Automated extraction of typing information for bacterial pathogens from whole genome sequence data: Neisseria meningitidis as an exemplar.

Whole genome sequence (WGS) data are increasingly used to characterise bacterial pathogens. These data provide detailed information on the genotypes and likely phenotypes of aetiological agents, enabling the relationships of samples from potential disease outbreaks to be established precisely. However, the generation of increasing quantities of sequence data does not, in itself, resolve the problems that many microbiological typing methods have addressed over the last 100 years or so; indeed, providing large volumes of unstructured data can confuse rather than resolve these issues. Here we review the nascent field of storage of WGS data for clinical application and show how curated sequence-based typing schemes on websites have generated an infrastructure that can exploit WGS for bacterial typing efficiently. We review the tools that have been implemented within the PubMLST website to extract clinically useful, strain-characterisation information that can be provided to physicians and public health professionals in a timely, concise and understandable way. These data can be used to inform medical decisions such as how to treat a patient, whether to instigate public health action, and what action might be appropriate. The information is compatible both with previous sequence-based typing data and also with data obtained in the absence of WGS, providing a flexible infrastructure for WGS-based clinical microbiology.


Introduction
The application of whole genome sequencing (WGS) technology to clinical microbiology has been described as revolutionary: the opportunities are certainly immense, but so too are the challenges of implementing this technology effectively [1]. Above all, clinical microbiology and epidemiology are pragmatic sciences, which require accurate and understandable information to be delivered to those who need to make medical judgements in real time. Often these judgements have to be made in the absence of complete information, and it is essential that widely understood, accepted and reproducible typing methods are employed to guide these decisions [2]. Just as the advent of molecular techniques challenged phenotypic methodologies over a decade ago -replacing imperfect but at least widely accepted techniques with a plethora of non-standardised alternatives [3] -the high volumes of sequence data have to be carefully managed if they are to provide enlightenment rather than confusion. The multilocus sequence typing (MLST) paradigm was established in 1998 [4], a time when molecular techniques were beginning to be widely used in the clinical laboratory, but when there was no universally agreed way forward [5]. It was intended as a standardised, reproducible and portable approach that could replace and enhance previous methods, particularly multilocus enzyme electrophoresis (MLEE) [6]. MLST was the first sequence-based approach to the genome-wide characterisation of bacterial isolates to be widely adopted and automated methods for performing the reactions and extracting the sequence information have subsequently been developed [7][8][9]. At the time MLST was introduced, it was impractical to sequence whole genomes on very large numbers of isolates and early analyses showed that in many cases this was not required. The first MLST scheme, for example, was designed to identify major clones within populations of Neisseria meningitidis, the meningococcus, and was able to do this reliably and reproducibly with just seven gene fragments, totalling only 3,284 bp or about 0.15% of the whole genome [10,11]. Similar numbers and sizes of loci have been successful for MLST schemes covering a wide range of organisms, which is an indication of the high degree of structuring present in many bacterial populations. For many bacteria, including the meningococcus, the extent of genetic diversity present even in this small number of genes under stabilising selection is extensive [12]: as of November 2012, each of the gene fragments used as meningococcal MLST loci had between 424 to 675 distinct alleles recorded on the PubMLST Neisseria website [13], with 54-94% (mean: 71%) sites that were polymorphic. Furthermore, in the representative abcZ locus, all four bases were present at a given site over the known population in 54/433 (12%) of the nucleotide positions ( Figure 1). Much of this variation is at low frequency and transitory, but the variants for which this is the case for cannot be known without exhaustive, or at least extensive, sampling over time.
The MLST approach catalogues this extreme diversity, which is seen in many microbial populations and which remains only partially explored, by the maintenance of curated libraries of allele sequences for each MLST locus. Each unique sequence (allele) is assigned a unique arbitrary number, effectively compressing 400-600 bp of information into a single integer. Further organisation and compression of genetic variation is attained by combining the data from all MLST loci into allelic profiles or sequence types (STs), which are also assigned arbitrary numeric designations, each of which defines a unique string of several thousand nucleotides [12]. This approach has proved to be both efficient and effective: as of November 2012, there were 9,927 STs in the Neisseria MLST database, for example, each precisely characterising a particular seven-locus Neisseria genotype. Similar levels of diversity have been observed in other bacteria hosted at PubMLST and on other MLST repositories [14]. The fact that nearly 10,000 distinct variants of only 3,284 bp of coding sequence under stabilising selection are known to exist in one human-associated bacterium with a genome of about 2.2 Mbp indicates the scale of the cataloguing problem facing us in the era of genomic microbiology.
Nevertheless, there are instances when even the very high levels of diversity routinely seen in MLST datasets do not provide sufficient information for clinical decision-making. This is because even populations of diverse organisms, such as the meningococcus, are highly structured, with most isolates belonging to clonal complexes of related bacteria, many of which share identical STs [15]. This detection of population structuring is one of the strengths of the MLST approach, as these clusters are frequently associated with phenotypes of clinical interest such as virulence or expression of vaccine antigens [16]. This clustering, however, can mean that isolates with the same ST may not have the same point source, so ST alone is insufficient to unambiguously identify strains belonging to an outbreak. For this reason, additional highly variable antigenic loci are included in the recommended typing scheme for meningococci [17] and for other organisms such as Campylobacter [18] that are regularly typed by MLST. For meningococci, there are also curated sequence-based schemes for genes that encode antimicrobial resistance that provide additional clinically valuable information [19,20]. Other schemes, such as variable-number tandem repeat (VNTR), also allow high discrimination of isolates in outbreak situations [21,22]. Combining these high-resolution typing approaches with seven-locus MLST and spatial and temporal epidemiology techniques permits the proactive identification of outbreaks of infectious disease [23].
For a small number of bacteria, the so-called single clone pathogens, there is insufficient variation in seven-locus MLST to provide epidemiological resolution, usually because these pathogens have evolved recently from single clones, undergo little recombination and contain too little genetic variation [24]. These include organisms of great medical importance such Mycobacterium tuberculosis [25], Yersinia pestis [26], Bacillus anthracis [27] and Salmonella enterica var Typhi [28]. For these bacteria, data from the whole genome, often in the form of single nucleotide polymorphisms (SNPs) [29], but also including other types of variation such as VNTRs, is essential for epidemiological purposes. These data will also have to be stored and interpreted in an accessible way that produces data usable by clinical decision-makers and which is both forwards and backwards compatible.
One of the motivations that drove the development of MLST was future-proofing. Even at a time when the costs of sequencing were seen by some as prohibitive [30], nucleotide sequence data had major advantages: they might be added to, but they would never become obsolete -as they represented the fundamental level of genetic information -and they are readily understood, stored, compared and distributed [12]. Obtaining WGS data is now becoming so inexpensive that it is becoming the fastest and most economical way of obtaining information at multiple loci for determining MLST or other STs [31]. When used in this way, these data are directly comparable to the extensive sequence databases that have been established since the first use of MLST [32,33]. Here we describe how the suite of databases hosted at PubMLST [34] has been updated to accommodate WGS data and describe the tools that are available to rapidly extract typing information from such data. We also describe how these tools can be exploited further to achieve very high resolution from such data when required.

Database structure
As of November 2012, the majority of the typing databases hosted at PubMLST [34] were using the Bacterial Isolate Genome Sequence Database (BIGSdb) platform to archive isolate and sequence diversity data [35]. This software was developed to facilitate the flexible storage and exploitation of the whole range of sequence data that might be available from a clinical specimen, from single Sanger sequencing reads through to whole genomes, which may be either complete or consisting of multiple contiguous sequences ('contigs'), as assembled from data from the current generation of sequencing instruments. The BIGSdb platform consists of two kinds of database: (i) a definition database that contains the sequences of known alleles of loci under study, as well as allelic profiles (combinations of alleles at specific loci) for schemes such as MLST; and (ii) an isolate database that contains isolate provenance and other metadata along with nucleotide sequences associated with that isolate. An isolate database can interact with any number of definition databases and vice versa, allowing networks of authoritative nomenclature servers and partitioning of isolate datasets and projects, with curator access controlled by specific permissions set by an administrator.

Reference databases
The definition databases are central to genome analysis using the gene-by-gene (MLST-like) analysis approach implemented in BIGSdb. By storing all known allelic diversity for any locus of interest, the definition databases provide a centralised queryable repository that provides a common language for expressing sequence differences, making it a trivial process to identify alleles that are different among isolates, and equally importantly, those that are identical. Because sequence differences are linked directly to a particular locus (which can be any definable sequence string, nucleotide or peptide) and with appropriate grouping of loci into 'schemes' (groups of related loci), the context of this locus is immediately apparent: identifying it, for example, as a member of a conventional MLST scheme, as responsible for antimicrobial resistance, as a participant of a biochemical pathway and so on. As of November 2012, the Neisseria PubMLST definition database had allelic sequences defined for 1,272 loci with 114,469 unique alleles.

Extracting typing information
Web-based and stand-alone tools have been developed that facilitate identification of STs directly from short-read data [36,37]. These methods are, of course, dependent on the sequence and profile definitions made available on PubMLST.org, which also has functionality to extract typing information directly from submitted assembled genomes that are routinely scanned for known alleles. As the locations of these loci are 'tagged' in the sequence data for future reference within BIGSdb, this means that the genome sequences are automatically annotated for those loci for which definition databases exist. The definition database can also be queried using genome data not uploaded to the isolate database to identify a strain directly from sequence data. The BIGSdb platform also has functionality that enables an administrator to define scanning rules and report formatting. This uses a built-in script interpreter so that analysis paths can be taken by following a decision tree defined by the rules. This has been implemented within the PubMLST Neisseria sequence definition database to automatically extract the strain typing information for the meningococcus (ST, clonal complex and antigen sequence type comprising PorA variable regions and FetA variable region) [17,33], along with antibiotic resistance information from sequence data that is pasted in to a web form (Figure 2, panel A). The script instructs the software to first scan the MLST alleles and, if these are all identified, to identify the ST and clonal complex by querying the reference data tables. It then scans the typing antigens and formats the results of these with the MLST results in to a standardised strain designation [17]. Following this, the sequences of the penA and rpoB genes are extracted and then compared with isolates with matching sequences within the PubMLST isolate database to determine the most likely penicillin and rifampicin sensitivity. All of this is displayed in a plain language report (Figure 2, panel B). The whole analysis is extremely rapid, taking about 40 seconds within the web interface.

Comparing genomes
Because genomic diversity is recorded within BIGSdb as allele numbers, WGS analysis is possible using the highly scalable techniques developed for seven-locus MLST. Once loci have been defined and alleles identified, they can be used essentially as a whole-genome MLST scheme, or any chosen subset of predefined loci combined to form a scheme. This is the principle behind the Genome Comparator analysis [38], which can use either the defined loci or extract coding sequences from an annotated reference genome to perform comparisons against genomes within the database. Using a reference genome, or set of predefined reference loci, each of the coding sequences are compared against the test genomes using BLAST. Allele sequences that are the same as the reference are designated allele 1, while each unique allele different from the reference is assigned a sequential number. Once each locus has been tested, a distance matrix is then generated based on allelic identities between each pair of isolates. This can then be visualised using standard algorithms -the PubMLST website incorporates the Neighbor-net algorithm [39] implemented in SplitsTree4 [40]. Because analysis relies only on using BLAST to compare each locus within a genome in turn, either against the single annotated reference sequence or against all known alleles if using defined loci, the analysis is again very rapid, allowing multiple genomes to be compared within minutes, with the time taken to analyse only increasing linearly, not geometrically, with additional genomes.
The Genome Comparator approach is generic and any number of loci in any groups can be used for this type of analysis. Many loci have been defined for the meningococcus, including the 53 ribosomal (r) genes that are used as a basis of rMLST [41][42][43][44]. The full complement of ribosomal genes has a number of advantages for indexing variation. These genes are universally present in members of the domain, are protein encoding and therefore generally assemble well from short-read sequences and are distributed throughout the genome. They encode proteins that form part of a coherent, macromolecular structure and contain variation that is informative at a wide range of levels of discrimination. These data can be used within and among members of the same genus, for both species and strain definition [42].

Analysis of whole genome sequence data for meningococci
The Neisseria PubMLST database is continually expanding: as of November 2012, there were 221 isolate records with deposited genome sequence data linked to published studies [11,[45][46][47][48][49][50][51]. Of these 221 genomes, 170 were meningococci, with the remainder belonging to other species within the genus [42]. The data consisted of a mixture of finished genomes, multiple contigs generated from de novo assembly, contigs generated by mapping to a reference sequence and sets of predicted coding sequences. These are treated identically by BIGSdb to identify and tag sequences of known loci, and where these loci are members of existing typing schemes, such as MLST or antigen typing, these genomes could be compared to legacy data (Table).
Neighbor-net visualisation of distance matrices generated with Genome Comparator from allelic rMLST data [44] provides a highly scalable, rapid and easily understood way of placing isolates within the known diversity of a bacterial species. For example, the interrelationships of 139 N. meningitidis isolates present in the PubMLST Neisseria database [13] can be efficiently represented by this method. Since rMLST alleles are automatically tagged within the database, this analysis is rapid and the Neighbor-net trees can be generated in a few minutes. The rMLST analysis differentiates clonal complexes; however, in addition it provides much higher resolution than conventional seven-locus MLST [38], robustly indicating both relationships among and diversity within clonal complexes (Figure 3).
The locations of isolates belonging to major clonal complexes identified by conventional MLST are indicated (cc1, etc.). The figure illustrates relationships not apparent from seven-locus MLST, including the diversity of some clonal complexes (e.g. cc1) and the interrelationships of others, e.g. cc8 and cc11 clonal complexes, and the relationships of the ET-15 and ET-37 variants within cc11.

Conclusions and future prospects
Nucleotide sequences are a universal language that can be interpreted in a number of ways. For clinical and epidemiological purposes, sequences from clinical specimens have to be rapidly and effectively translated into a meaningful term or set of terms that define those properties of the aetiological agents of disease that direct medical and public health action. One of the factors behind the success of seven-locus MLST was the introduction of standard sets of nomenclature that reflected the structure of microbial populations and their phenotypic properties. For organisms with well-established and accepted MLST and other typing schemes in place, the impact of the application of WGS data will be to rapidly identify properties such as strain type. In some cases, novel nomenclature may be required, but this is a process that has to be approached with care, if confusion in the wider clinical community is to be avoided.
The suite of database subsites on PubMLST, which now includes a site that catalogues the ribosomal diversity across the whole domain for the purposes of rMLST typing [44,52], provides an example of how WGS data can be used to efficiently designate specimens to current strain types. It can be also used to establish additional typing schemes which can coexist with each other side by side, as there is no limit to the number of loci and schemes that can be defined. As the database stores the sequence information that is available for an isolate, be that a single read or a whole genome, it means that it is possible to seamlessly compare isolates for which different types of information are available, achieving backwards compatibility with previous typing schemes, as well as compatibility with diagnostic tests that may target only one or a few loci. The extent to which isolates can be compared depends only on the quality of the sequence data available for the locus in question, but given that clinical specimens are often imperfect, it is important for clinical and epidemiological purposes that incomplete or partial information can be used. While many studies place short-read data in a sequence read archive, this is not easily accessible or readily analysed. PubMLST curators do proactively assemble short-read data and incorporate the resultant contigs into the database where metadata are available. Links are made to the sequence read archive within PubMLST isolate records so that original data can be retrieved and analysed when required. While the Neisseria databases described are exemplars, databases for other species can be hosted on request and the open-source BIGSdb software is freely available for local installation.
The first analyses of WGS data on bacterial specimens relied on SNP analysis of closely related bacteria, with mapping of sequence reads to a predefined reference genome. These have required pre-analysis of the samples by an approach such as MLST to limit the extent

Figure 3
Relationships of 139 Neisseria meningitidis genomes in the PubMLST Neisseria database, generated with Genome Comparator and Neighbor-net from allelic profiles data for rMLST loci r: ribosomal; MLST: multilocus sequence typing. The locations of isolates belonging to major clonal complexes identified by conventional MLST are indicated (cc1, etc.). The figure illustrates relationships not apparent from seven-locus MLST, including the diversity of some clonal complexes (e.g. cc1) and the interrelationships of others, e.g. cc8 and cc11 clonal complexes, and the relationships of the ET-15 and ET-37 variants within cc11. of variation being analysed [53][54][55][56][57][58]. This approach is also appropriate and can be very effective for 'single clone' pathogens [25][26][27][28]; however, it is not feasible for the general analysis for diagnosis or surveillance of bacteria such as the meningococcus that exhibit more typical levels of sequence diversity. Indeed, the use of the term SNP when discussing bacterial genome variation outside the examples described above, is unfortunate and can be misleading. The concept of the 'SNP' has been taken from human medical genomics to microbial genomics: in humans, it is in some cases appropriate to discuss SNPs, when they are associated with a particular genetic disease, but genetic variation in terms of sequence polymorphism is much more complex in bacteria. As seen here, the great majority of microbial populations contain tens of thousands of polymorphisms even within organisms that are closely related -not to mention large amounts of variation due to insertions, deletions and rearrangements, which cannot even remotely be described as 'SNPs'. The term sequence variation is more appropriate as individual polymorphisms, especially in bacteria, are invariably embedded with many other variants into alleles and it is these alleles -each often with many variable sitesthat are associated with particular phenotypes.
Although the typing of bacterial specimens with existing schemes is a valuable contribution of WGS data to clinical microbiology and epidemiology, it is not, of course, the only use for these data. There are many other possible applications for both research and detailed investigation of outbreaks [38]; however, it is important that the analysis of these data is driven by the question that is being asked. If an outbreak can be resolved with a few loci, then there is no need to pursue the data further and certainly no need to report more detail than necessary to a hard-pressed frontline clinician or epidemiologist who, in general, will only require the information necessary to resolve the medical problem at hand. In other cases, resolution of a particular outbreak may require data from the whole genome [53]. For this reason, it will be increasingly necessary to store WGS data from clinical specimens in an understandable form, that is, as assembled sequences, within flexible structures, such as that offered by the PubMLST platform powered by BIGSdb, where WGS information can be hierarchically queried in real time by individuals with limited bioinformatics expertise to generate the data at the resolution required to address their problem. In this context these data will provide an exciting opportunity to extend our understanding of infectious disease caused by bacteria and will enhance our ability to combat it.