Cluster analysis of the origins of the new influenza A(H1N1) virus.

In March and April 2009, a new strain of influenza A(H1N1) virus has been isolated in Mexico and the United States. Since the initial reports more than 10,000 cases have been reported to the World Health Organization, all around the world. Several hundred isolates have already been sequenced and deposited in public databases. We have studied the genetics of the new strain and identified its closest relatives through a cluster analysis approach. We show that the new virus combines genetic information related to different swine influenza viruses. Segments PB2, PB1, PA, HA, NP and NS are related to swine H1N2 and H3N2 influenza viruses isolated in North America. Segments NA and M are related to swine influenza viruses isolated in Eurasia.


Introduction
Influenza A virus is a single stranded RNA virus with a segmented genome. When different influenza viruses co-infect the same cell, progeny viruses can be released that contain a novel mix of segments from both parental viruses. Since the first reported pandemic in 1918, there have been two other pandemics in the 20th century. In both cases, the pandemic strains presented a novel reassortment of genome segments derived from human and avian viruses [1][2][3]. The origins of the 1918 strain are so not clear, although different analyses suggest that this virus had an avian origin [4,5].
When and where pandemic reassortments happen remains a mystery. Avian viruses often undergo reassortment events among different subtypes. Several reports suggest that reassortments are also frequent between human viruses [6,7]. Swine have been found frequently with co-infections and reassortment of swine, human, and avian viruses has been reported [8][9][10]3]. In addition, cell surface oligosaccharide receptors of the swine trachea present both, a N-acetylneuraminic acid-alpha2,3-galactose (NeuAcalpha2,3Gal) linkage, preferred by most avian influenza viruses, and a NeuAcalpha2,6Gal linkage, preferred by human viruses [11]. Co-infection combined with co-habitation of swine and poultry on small family farms all over Asia, and the presence of avian as well as human receptor types in pigs have led to the "mixing vessel" conjecture [12,13] that suggests that most of the inter-host reassortments are produced in pigs.
Recently, a new A(H1N1) subtype strain has been identified initially in Mexico, then rapidly reported in all continents. As of 27 May, 12,954 cases of the new influenza A(H1N1) virus infection, including 92 deaths have been reported to the World Health Organization [14,15]. Several approaches have been used to understand the origins of this strain. Searches in public databases containing influenza A genomes using sequence alignment tools indicated that the closest relatives for each of the eight genomic segments are from viruses circulating in swine for the past decade [16][17][18][19]. These include genome segments derived from "triple reassortant" swine viruses that combined in the late 1990s genome segments from viruses previously identified in humans, birds, and swine [20]. Similar conclusions were drawn by the application of phylogenetic techniques [16,21].
Here we present a cluster analysis using Principal Component Analysis and unsupervised clustering. Clustering methods are particularly robust under changes in the underlying evolutionary models. Our results substantiate previous reports [16,21], and demonstrate that for each of the genome segments of the new influenza A(H1N1) virus the closest relative was most recently identified in a swine, compatible with a reassortment of Eurasian and North American swine viruses ( Figure 1).

Materials and methods
Influenza sequences were obtained from the National Center for Biotechnology Information (NCBI) [22] in the United States. We performed a search using Basic Local Alignment Search Tool (BLAST) for each of the eight A/California/04/2009(H1N1) segments separately, recording the 50 best matches. Then we constructed the union of all these matches, taking the sequences for all their segments available in the database. We aligned these sequences using the stretcher algorithm as implemented in the EMBOSS package.
After the alignment we translate the sequences into the binary data, comparing them to the reference sequence site by site. A mutation maps to 1, while a nucleotide identical to that in a reference sequence maps to 0. Whenever there are masks, they map to the corresponding fractional numbers. Gaps are not counted as polymorphisms. Therefore, if there are the S sequences restricted to the P polymorphic sites, these data translate to the SxP matrix. Each row of this matrix can be thought of as a vector in a P-dimensional space, and it represents one of the sequences.
We perform the Principal Component Analysis (PCA) in order to determine the most significant coordinates in this P-dimensional space. After this we leave the principal components which capture 85% of the total variance, discard the remaining ones and project the data onto this relevant coordinate subset. This procedure is followed by the consensus K-means clustering. Namely, if one targets for K clusters, one repeats the K-means clustering procedure N times, and forms the matrix n whose elements nij (i,j=1,…,S) represent the number of times out of the N trials when the i-th and j-th sequences were clustered together. In our analysis we set N≥100. The matrix of the distances between the samples is: One then performs the standard hierarchical clustering with this matrix, targeting for the K clusters. This procedure does not depend on any assumptions made by the phylogenetic models. Note that these techniques can be used for inferring phylogenies as well [23], though this is beyond the scope of the present note.

Results
Sequence comparison of available sequences of the new A(H1N1) virus (as of 27 May 2009) did not identify significant sequence variation, except for a few point mutations. Hence A/California/04/2009(H1N1) was chosen as the representative for further analyses. There are many different phylogenetic techniques, each of them with their own assumptions about evolutionary models that vary in the way of computing genetic distances, probabilities, etc. As opposed to phylogenetic techniques, cluster methods do not have a need for evaluation of a tree, which is a more complicated structure than a set of clusters. Clustering techniques do not provide a detailed phylogenetic structure because they analyse group features of the sequence data. That is why the clustering analysis is more robust to the assumptions we make, for instance, the choice of genetic distance. Unsupervised methods provide a way of identifying clusters without relying on previous information about the origins, host and time isolation.  Table. Our analyses support the hypotheses whereby the 2009 pandemic influenza A(H1N1) virus derives from one or multiple reassortment(s) between influenza A viruses circulating in swine in Eurasia and in North America. It is schematically illustrated in the

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Figure 1. Origins of the new influenza A(H1N1) virus
Schematic representation of the main results of the cluster analysis. The analysis shows that the recent A(H1N1) virus is a reassortment of at least two swine influenza viruses from North America (in light blue) and Eur asia (in dark blue).