Establishment of the European meningococcal strain collection genome library (EMSC-GL) for the 2011 to 2012 epidemiological year

Invasive meningococcal disease surveillance in Europe combines isolate characterisation and epidemiological data to support public health intervention. A representative European Meningococcal Strain Collection (EMSC) of IMD isolates was obtained, and whole genome sequenced to characterise 799 EMSC isolates from the epidemiological year July 2011–June 2012. To establish a genome library (GL), the isolate information was deposited in the pubMLST.org/neisseria database. Genomes were curated and annotated at 2,429 meningococcal loci, including those defining clonal complex, capsule, antigens, and antimicrobial resistance. Most genomes contained genes encoding B (n = 525; 65.7%) or C (n = 163; 20.4%) capsules; isolates were genetically highly diverse, with >20 genomic lineages, five of which comprising 60.7% (n = 485) of isolates. There were >350 antigenic fine-types: 307 were present once, the most frequent (P1.7-2,4:F5-1) comprised 8% (n = 64) of isolates. Each genome was characterised for Bexsero Antigen Sequence Typing (BAST): 25.5% (n = 204) of isolates contained alleles encoding the fHbp and/or the PorA VR1 vaccine component, but most genomes (n = 513; 64.2%) did not contain the NadA component. EMSC-GL will support an integrated surveillance of disease-associated genotypes in Europe, enabling the monitoring of hyperinvasive lineages, outbreak identification, and supporting vaccine programme implementation.


Introduction
Invasive meningococcal disease (IMD) is a contagious, occasionally epidemic, infectious disease that causes significant morbidity and mortality with varying incidence globally [1]. Only a small proportion of individuals infected with Neisseria meningitidis, the meningococcus, develop severe disease, usually meningitis and/or septicaemia, with survivors often suffering permanent sequela [2]. While vaccination with protein-conjugated polysaccharides of serogroup A, C, W, and Y is an effective prevention strategy for the main disease-causing serogroups, currently there is no comprehensive vaccine against meningococcal disease. The structural similarity of the serogroup B polysaccharide capsule to the human neural cell adhesion molecule renders it poorly immunogenic and carries concerns over possible autoimmune reactions [3,4]. To circumvent this problem, vaccines developed to control serogroup B, the main cause of IMD in Europe, have been based on outer-membrane vesicles and more recently on recombinant proteins [5,6].
Understanding the impact of meningococcal vaccines requires surveillance that includes knowledge of disease-associated meningococci and the antigens and capsule types that they possess. However, comprehensive and timely surveillance of IMD is complicated by the high genetic and antigenic diversity of meningococcal populations [7]. While most meningococcal disease is caused by a minority of meningococcal capsule types, there are many variants of the subcapular antigens that are used in vaccines [8]. However, relatively few meningococcal genotypes, known as hyperinvasive lineages which have been identified by multilocus sequence typing (MLST) as particular clonal complexes (ccs), cause most IMD and these have been shown to be associated with particular variants of antigens and combination of vaccine antigen variants [7,9]. Conventional isolate typing has required multiple tests involving sequencing of multiple genes [10] but whole genome sequencing (WGS) provides the potential to identify all of the salient features in a single experiment [11].
The establishment of a European meningococcal genome library has potential to enhance IMD surveillance continent-wide, providing comprehensive and comparable long-term typing data. To do this, all national reference laboratories for meningococci in Europe were asked to make available in 2014 isolates from one epidemiological year (2011-2012) to contribute to the European Meningococcal Strain Collection (EMSC) in order to provide a bank of isolate genomes representative of those causing invasive disease in Europe at that time. The EMSC Genome Library  (EMSC-GL) was created to exploit whole genome sequence (WGS) data for surveillance, characterisation, and vaccine intervention studies to analyse longitudinal data using both historical, current and future datasets. The WGS data are hosted on the pubMLST. org/neisseria database, where the analysis tools of the BIGSdb platform can be used [12].

Specimen collection
The

Culture and DNA extraction
Isolates were grown on blood agar plates overnight at 37 °C in an atmosphere of 5% CO 2 and DNA extracted using an epMotion 5075 pipetting robot (Eppendorf) and NucleoSpin 8 Tissue kit (Macherey-Nagel) according to the manufacturer's instructions. DNA was visualised with the FlashGel DNA system (Lonza), its quantity assessed using Qubit (Invitrogen, ThermoFisher Scientific Inc), and sent frozen in 96-well plates to the Oxford Genomics Centre (OGC), University of Oxford, United Kingdom (UK).

Whole genome sequencing, de novo assembly, annotation and curation
The DNA samples were normalised and sequenced at the OGC. Quantification and integrity of DNA extracts were assessed using the Invitrogen PicoGreen assay on a 1% E-GelEX. DNA was fragmented using acoustic shearing and the fragment distribution determined using the Agilent Tapestation D1200 system. NEBNext DNA Sample Prep Master (New England BioLabs) was used with minor modifications to construct the libraries. Ligation of the adapters was completed using  Isolate records were created in the pubMLST.org/neisseria database and included provenance and phenotypic typing information [12]. Short-read sequence data were assembled de novo as described previously [13], and, uploaded to the database and then linked to their respective isolate record. Genome assembly statistics were generated as part of this automated pipeline. The data are accessible on pubMLST.org/neisseria web page.
Each genome assembly was queried against the pubMLST.org/neisseria sequence definition database to identify the loci within individual genomes. As identified, loci were automatically annotated with a Neisseria (NEIS) locus identifier and an allele number assigned if an identical reference sequence was present in the database. Novel alleles with ≥ 98% sequence identity to a reference sequence in the database, were automatically assigned new reference numbers. This process was repeated until all NEIS loci had been curated and as many alleles as possible had been defined and assigned in each genome assembly. The new alleles with < 98% sequence identity were manually checked and curated before assigning a new reference number. The annotated loci included those defining antigens, genes known to effect antibiotic sensitivity, the capsule operon genes, and the 1,605 loci belonging to the N. meningitidis core genome MLST (cgMLST) scheme (version 1.0) [9,13,14]. Alleles with a known antibiotic sensitivity range were previously submitted to the pubMLST.org/neisseria sequence definition database and EMSC-GL alleles with the same sequence mutation pattern were inferred to convey the same antibiotic sensitivity range [15][16][17].

Genomic analysis and population structure
Using the 'sequence bin breakdown' tool, the assembly statistics and genome coverage were assessed for each genome assembly. The BIGSdb Genome Comparator Tool was used to generate allele-based pairwise comparison distance matrices that were visualised as Neighbour-Net phylogenies. Cramer's V coefficient was used to measure the association of clonal complex (cc) with Bexsero Antigen Sequence Typing (BAST) and individual vaccine antigens, using the 'cramersV' function in the 'lsr' package in R 3.3.0 [18]. The coefficient ranges from zero to one, with zero indicating no association and one indicating complete association.

Curatorial support and database operation
The data confidentiality and security: no personal identification is recorded in the database and isolates are identified by an anonymised sample name and an automatically assigned isolate accession number. Each reference laboratory is provided immediate access to their submitted data once it has been received and assessed by a curator. The pubMLST database and its curators are currently supported by The Wellcome Trust (grant 104992) and the UK Department of Health (grant H2R00080) respectively.

Assembly and genetic coverage
The isolates (n = 799) represented IMD cases from 16 EU countries during the epidemiological year July 2011-June 2012. The genome assemblies contained an average of 466 contigs, with an average length of 5,211 nt per contig, and an average genome length of 2,143,632 nt. The larger number of contigs and unassembled genes compared to other studies was due to fragmentation of the DNA during extraction process [13,19]. The pubMLST.org/neisseria database (accessed 06/2016) contained 2,429 curated loci, of which 1,605 genes (66.1%) are considered to be a part of the core genome (cgMLST, v1.0) [13]. The curated loci are composed of coding sequences, typing fragments, antigenic peptides, intergenic promotor regions, and other sequence regions of interest. Each draft genome contained between 1,724 and 1,933 identifiable genes with an average of 1,831 per genome. The average genetic content for a N. meningitidis genome, based on finished genomes (NZ-05/33, FAM18, MC58, H44/76, Z2491, F2136 and M01-240196), was 1,967 genes.
The isolate without serological typing information contained capsule type B genes. Two genomes had a poorly assembled region A and the gene that determines capsule type was incompletely assembled; however, the assembled sequence indicated that genetically one isolate was B and one isolate was C.
Of the serogroup records that did not match capsule genotype: three isolates assigned to serogroup C were capsule type B (n = 2) and capsule type Y (n = 1); three serogroup Y were capsule type B; seven isolates assigned to serogroup B were capsule type C (n = 5), capsule type E (n = 1), and one isolate did not contain the capsular regions A or C and was confirmed to be a capsule null (cnl); and one typed as serogroup W was genetically a capsule type Y.
The capsule types determined for the NG isolates included: five capsule type B; two capsule type W; two capsule type X; one capsule type E; two cnl; one capsule Y; and one with a serine residue at codon 310 of the polymerase gene, which would result in a mixed sialic acid capsule phenotype for W and Y [20]. Two of the NG isolates contained capsule B genes but were genetically phase variable off.
The isolate with a mosaic capsule contained capsule biosynthesis genes from serogroup B in addition to a partial region A for capsule type E. There was no report of secondary capsule expression nor inconclusive serogroup results reported for the isolate.

Clonal complex
MLST was determined from the WGS data. For 402/799 isolates (50.3%) MLST data obtained by conventional means were also available. The WGS data confirmed the sequence type (ST) and cc for 377/402 (93.8%) of these isolates and determined the ST and cc for 395 of the remaining 397 (99.5%) isolates.
Concerning the 25 isolates among the 402, where the conventional MLST and the WGS data did not concur: (i) for two the gdh MLST locus was incomplete but the remaining six MLST alleles were identified: one isolate had a profile consistent with the ST-269cc, but was nonetheless not assigned to any specific ST, while the other was unique and did not match any currently defined ST; (ii) and for 23 isolates, the WGS-derived MLST differed from that previously reported. Most (17/23) differed at a single MLST locus (including 10 genomes differed at pgm, 4 at fumC, 2 at abcZ, and 1 at the aroE locus), three differed at two loci (pgm and abcZ or fumC or gdh, respectively), and three isolates differed at all seven loci. The isolates where the ST was one or two loci different were still assigned to the same cc, the three isolates with seven loci differences were each reassigned to a different cc. These results were consistent with the downstream effects of splitting  samples in seven-locus PCR-based MLST and mislabelling of samples [13].
The meningococcal genomes were characterised with the BAST scheme. The 4CMenB (Bexsero) vaccine contains four different peptide components and is represented by the peptide profile BAST-1 (fHbp 1, NHBA 2, NadA 8, PorA VR1 7-2, PorA VR2 4) [9]. There were no occurrences of BAST-1 in the EMSC-GL. BAST antigens were diverse and 687 genomes (86.0%) were assigned to one of 332 BASTs. The genomes without a BAST  assigned (n = 112; 14.0%) had an incomplete sequence assembly in the region containing the NEIS2109 (nhba) or the NEIS1969 (nadA) gene. The most common 12 BASTs were present at a frequency of at least 1.0% (n = 8) and an additional 32 BASTs were represented by at least three to seven genomes each (Figure 2

Antimicrobial resistance
The pubMLST.org/neisseria database contained some phenotypic antimicrobial data in the form of minimum inhibitory concentration (MIC) ranges. These data were limited to a subset of isolates used for targeted gene sequencing to define susceptibility for ciprofloxacin (NEIS1320, gyrA gene), penicillin (NEIS1753, penA gene), and rifampicin (NEIS0123, rpoB gene) [15,16]. Alleles without MIC data were compared with those with a MIC value to determine if the allele sequence contained known mutations associated with a phenotypic MIC range (Table 3).
Forty-one rpoB alleles were present in the dataset, 23 of which were associated with a susceptible MIC (≤ 1µg/ mL). Most genomes (n = 501; 62.7%) contained one of four rpoB alleles: 18, 2, 4, or 34, and an additional eight alleles were present in at least 1% of the genomes (n = 256; 32.0%). The next most prevalent, allele 184 (n = 5; 0.63%), had no MIC associated testing data and was found in Germany (n = 3) and Poland (n = 2). There were 18 additional alleles, each present in the dataset once, with no MIC associated testing data.
There were 141 genomes (17.7%) with an incompletely assembled penA gene and no allele assignment and 44 genomes (5.5%) did not have the penA gene. There were 67 penA alleles in the dataset, 13 of which had MIC values that were generally sensitive, ≤ 0.06µg/mL (two alleles were always sensitive and 11 alleles were sensitive but had at least one isolate record where the MIC value was intermediate, > 0.06 to 1µg/mL). Thirtyeight penA alleles were associated with MIC values of intermediate sensitivity (> 0.06 to 1µg/mL). There were 16 additional alleles with no MIC associated testing data, 15 of which were found once in the dataset and one allele was found eight times, from Germany (n = 4), Poland (n = 3) and Hungary (n = 1).
There were a total of 19 gyrA alleles and 642 genomes (80.4%) contained allele 4, 2, or 3. Eight genomes (1.0%) had an incompletely assembled gyrA gene. Five additional alleles were found in at least 1% of the isolate genomes and represented an additional 16.5% of the genomes (n = 132). All eight alleles had an associated susceptible MIC test result (≤ 0.03µg/mL) and half of the alleles had at least one isolate record where the MIC result was intermediate (> 0.03 but < 0.12µg/mL). Only three of the eleven remaining alleles had MIC associated data and sequence comparison of the eight alleles with known MIC data was not discriminatory.
Additional groups were evident, including a cluster of genomes from Poland (n = 7), Germany (n = 7), and Hungary (n = 2), containing eight STs not associated with a cc. Members of this new lineage, while sharing up to five of seven MLST alleles with members of cc174 (lineage 14), were distinct using cgMLST (data not shown). A number of lineages formed larger groups or clades including: lineages 11 and 8; and lineages 2 and 35 with lineage 5. Lineage 11 and lineage 23 contained distinct sub-lineages, (11.1 and 11.2; 23.1 and 23.2) as previously described [19]. The vaccine antigen variants and combinations were associated with particular ccs and lineages, with Cramer's V values for cc association ranging from 0.458 (NadA) to 0.703 (NHBA) and 0.94 (BAST).

Lessons learnt
Neisseria genomes contain repeat regions, homopolymeric tracts, and paralogous loci [22,23]; all of which may affect sequencing fidelity, assembly synteny, and genome annotation. In addition, whole chromosome extraction and library preparation methods will have downstream effects on the quality of the assembled genome data. A pilot run comprising a minimum of 96 samples is therefore recommended so that the impact of such issues can be ameliorated before protocols are implemented on larger studies, which can be costly. The curation of genomic data within the pubMLST.org/ neisseria database was developed using a population annotation approach which places allelic diversity at the centre of analysis [12,13], and therefore required a sequence definition database of proportional size and variability to the organism's genetic variation. The allelic annotation against a locus specific definition, where accurate start and stop codons were described while capturing the population variability in a meaningful and usable way, requires dedicated manual oversight of curation. Sustaining the computational and comparative value of a genome library will require regular manual checks, data cleaning, and periodic locus definition updates over the life of the database.

Discussion
The sporadic nature of IMD, combined with the unpredictable occurrence of epidemics and the diversity of meningococci, and especially of meningococcal antigens, makes accurate surveillance an essential element of effective public health measures [24]. The well-documented propensity of particular hyperinvasive lineages to cause IMD outbreaks across continents makes international surveillance of particular importance [7]. Within Europe, there has been coordinated surveillance of IMD since the early 1990s [25]. During this time, meningococcal isolate characterisation was greatly improved by the introduction of molecular methods [26]. The latest enhancement, the use of WGS data was first deployed for routine and comprehensive IMD surveillance on a national scale in the UK, with the establishment of the Meningitis Research Foundation Meningococcus Genome Library (MRF-MGL) from 2010 [19]. The WGS approach provides high-resolution indexing of bacterial diversity potentially replacing phenotypic methods and enabling multiple features of a clinical isolate to be characterised in a single test [13]. With WGS technology becoming increasingly accessible and affordable the EMSC-GL becomes a practical tool for IMD surveillance, complementing and enhancing the information collected by the European Surveillance System (TESSy) database at ECDC and contributing to studies of population biology, evolution, and the consequences of vaccine implementation [27,28].
To maximise international public health impact, the EMSC-GL is hosted within the pubMLST.org/neisseria web-accessible database, which also hosts the MRF-MGL, and allows members to submit genome data and the isolates associated provenance information. This provides publicly-accessible sequence data, described with a uniform nomenclature and linked to relevant provenance and phenotype information. As of early 2017, the continually-expanding pubMLST.org/neisseria database comprised over 2,400 Neisseria loci, including 38 sequence-based typing loci such as antigen genes, seven locus MLST, and cgMLST loci. This permits the rapid genetic characterisation and comparison of > 93% of the content of future meningococci genomes added to the genome library with minimal additional curation. Within pubMLST.org/neisseria these loci are grouped into different schemes: currently 40 such curated schemes are publically available, including those for metabolic pathways, plasmid and phage components, typing profiles, and genomic regions [13]. The WGS approach provides a number of practical advantages in addition to its inherent high-resolution and comparability of nomenclature [13,29]. It is a single test that eliminates the multiple testing of a single specimen, which can introduce errors [30]. For example, the incorrect grouping of alleles into a conventional MLST profile can affect the investigation of an outbreak of invasive meningococcal disease [31]. While errors can occur during de novo WGS assembly, cgMLST analysis ameliorates these problems as, in a high-quality draft genome, such problems affect a minority of loci which are unlikely to materially alter an analysis [13].
Given its importance in invasion and as a vaccine antigen the meningococcal capsule, which defines the serogroup, is one of the most important isolate characteristics required for disease surveillance. The capsule encoding genome regions are analysable by WGS, resolving anomalies that can arise from phenotypic data and enabling the characterisation of meningococci that do not express capsule. The EMSC-GL contained a small number of isolates with indeterminate W/Y expression and isolates containing genes of more than one capsule type [19]. The difficulty in determining the phenotype of phenotypically W/Y isolates was the result of a rare serine residue at codon position 310 of the NEIS2162 (csw) and NEIS2163 (csy) genes and, while there is no evidence of multiple capsule expression, horizontal genetic transfer of region A capsule genes has been described for other isolates [14,32]. The presence of different capsule genes highlights recombination potential among closely related meningococci, which can lead to novel phenotypes of clinical and epidemiological significance [14]. Finally, although the production of a capsule is considered to be a virulence factor, there have been reports of un-encapsulated meningococci, one of which was present in the EMSC genomes, causing IMD [33,34].
Analysis of cgMLST data defined 13 lineages, had good ccs concordance as defined by conventional MLST, and identified a new lineage. Vaccine antigens were also rapidly extracted, including those used to assign BAST types, demonstrating a strong association between vaccine antigen variants and cc among the EMSC-GL. This association was consistent with previous studies in other countries such as Ireland, the Netherlands and the UK, and is important when considering the likely impact of protein-based vaccines because prevalent lineages change over time likely causing changes in disease incidence [9,19,35].
The deposition of future meningococcal genome data in the EMSC-GL will permit the resolution of the epidemiology of IMD at the whole genome across the entire continent, and as long as the European Union (EU) maintains funding the infrastructure costs are minimal. The inclusion of additional isolates in close to real time will enable the rapid identification of outbreaks. The past 40 years have seen the spread of successive hyperinvasive meningococci throughout Europe and, while disease levels remain low in most European countries most of the time, epidemics, hyperendemics, and localised disease outbreaks occur regularly across the continent, including transnational outbreaks [19,26,[36][37][38]. The advantage of WGS data is that it comprehensively characterises isolates for their genetic lineage, vaccine antigens, antimicrobial sensitivity, and other properties rapidly and effectively and is able to resolve outbreaks to a high level of discrimination [13,31]. These advantages greatly outweigh perceived high costs of the WGS approach. Future addition of genomes to the library will enable monitoring of secular changes in circulating meningococci over time and an assessment of the possible effects of the introduction of vaccines, such as the protein-based 4CMenB (Bexsero) and LP2086 (Trumenba) vaccines on IMD epidemiology [9]. Effective meningococcal vaccine development and implementation depends on an understanding of the population biology and the effects of herd immunity in an endemic setting, which in turn relies on high-quality surveillance data and isolate characterisation [39].

Conclusion
In conclusion, the detailed meningococcal isolate characterisation available thought the EMSC-GL will be essential for European IMD surveillance, leading to the improvement of control strategies and disease prevention across the continent. This model is generalisable to other bacterial pathogens, many examples can be found on the pubMLST.org website (https://pubmlst. org/databases/).

*Erratum
In Table 3, the column headings were misaligned. This was corrected on 31 May 2018.