Design and Application of a Core Genome Multilocus Sequence Typing Scheme for Investigation of Legionnaires' Disease Incidents

Citation style for this article: Haar K, Amato-Gauci AJ. European men who have sex with men still at risk of HIV infection despite three decades of prevention efforts. Sequence-based typing (SBT) for Legionella pneu-mophila (Lp) has dramatically improved Legionnaires' disease (LD) cluster investigation. Microbial whole genome sequencing (WGS) is a promising modality for investigation but sequence analysis methods are neither standardised, nor agreed. We sought to develop a WGS-based typing scheme for Lp using de novo assembly and a genome-wide gene-by-gene approach (core genome multilocus sequence typing, cgMLST). We analysed 17 publicly available Lp genomes covering the whole species variation to define a core genome (1,521 gene targets) which was validated using 21 additional published genomes. The genomes of 12 Lp strains implicated in three independent cases of paediatric humidifier-associated LD were subject to cgMLST together with three 'outgroup' strains. cgMLST was able to resolve clustered strains and clearly identify related and unrelated strains. Thus, a cgMLST scheme was readily achievable and provided high-resolution analysis of Lp strains. cgMLST appears to have satisfactory discriminatory power for LD cluster analysis and is advantageous over mapping followed by single nucleotide polymorphism (SNP) calling as it is portable and easier to standardise. cgMLST thus has the potential for becoming a gold standard tool for LD investigation. Humidifiers pose an ongoing risk as vehicles for LD and should be considered in cluster investigation and control efforts.


Introduction
Legionellae are Gram-negative rods found in aqueous environments [1].Humans become infected through exposure to contaminated aerosols originating from man-made water systems, such as spa pools, cooling towers or showering facilities in various settings such as healthcare, public and domestic facilities as well as occupational and travel-related settings.Clinical manifestation varies from a mild illness (Pontiac fever) to potentially fatal pneumonia known as Legionnaires' disease (LD) [2].Among nearly 70 Legionella species described, Legionella pneumophila (Lp) causes the vast majority of LD and of 16 known serogroups, Lp serogroup 1 accounts for over 80% of LD cases and almost all clusters and outbreaks [3,4].
LD is a notifiable disease in all European Union and European Economic Area countries.Surveillance coordinated by the European Legionnaires' Disease Surveillance Network (ELDSNet) of the European Centre for Disease Prevention and Control (ECDC) demonstrates a mild increase in incidence of LD since 2001 [5,6].The standardised sequence-based typing (SBT) scheme for Lp developed by the European Working Group for Legionella Infections (EWGLI, now European Society for Clinical Microbiology Study Group on Legionella Infections, ESGLI) marked an important advancement in the study of the molecular epidemiology of LD [7,8].Implementation of this typing scheme, similar to multilocus sequence typing (MLST) has yielded useful and comparable data worldwide [9] and has been shown to be applicable in the investigation of unusual LD cases such as humidifier-associated LD [10] and legionellosis outbreaks [11].
The advent of next-generation sequencing (NGS) has revolutionised microbiology by making whole genome sequencing (WGS) of pathogens of public health importance, readily available [12].The currently most significant role for NGS in microbiology is communicable disease surveillance and outbreak investigation.Many studies have demonstrated that whole genome comparisons provide far greater resolution for outbreak detection and microbial strain tracking than gold standard typing methods of different bacteria [13][14][15].While most studies using WGS-based molecular epidemiology have relied on mapping of read data against a reference followed by analysis of single nucleotide polymorphisms (SNPs), a de novo assembly and genome-wide gene-by-gene approach looking at allelic differences in core genome (cg) genes (cgMLST, or MLST + as called in the SeqSphere + software used for analysis) has been suggested as an alternative to SNP mapping [16,17].
There are limited data regarding the application of WGS for investigation of LD.Moreover, current experience is limited to analyses of SNPs and thus there is an unmet need for a cgMLST typing scheme for Lp that would enable a portable global nomenclature.Therefore, the goal of the current study was to set up, validate and apply a cgMLST scheme for Lp.

Standard laboratory work up of Legionella pneumophila strains
Isolates were cultured on BCYEα plates (Oxoid, Basingstoke, United Kingdom) for 48-72h at 35 °C before phenotypic and molecular tests were performed.Presumptive identification as Lp was confirmed using MonoFluo Legionella pneumophila indirect fluorescent antibody (IFA) Test (Biorad, Hemel Hempstead, United Kingdom).Lp serogroups and immunological subgroups (for selected serogroup 1 isolates) were determined using the Dresden panel of monoclonal antibodies [18].Strains not readily confirmed as Lp were identified to species level by sequencing the mip gene as described by Ratcliff et al. [19] and comparing the sequence to the mip database [20].
Between two to three single colonies per Lp strain were selected and DNA extracted using the InstaGene Matrix (Biorad, Hemel Hempstead, United Kingdom).The genotype of each strain was determined using the M13 modification of the ESGLI SBT method by Sanger sequencing [21].All alleles and sequence types (ST) were determined using the Legionella Sequence Quality Tool [22,23].SBT was attempted on sputum in culture-negative cases using the direct or nested-SBT approach [21].

Whole genome sequencing and assembly
Whole genome shotgun sequencing was performed on 15 strains recovered from clinical and environmental samples in Israel.High molecular weight and quality DNA was extracted using the Wizard DNA purification kit (Promega, Madison, WI, United States).Sequencing libraries were prepared using the Nextera chemistry (Illumina Inc., San Diego, California, United States) for a 250 bp paired-end sequencing run on an Illumina MiSeq sequencer.Samples were sequenced to aim for a minimum coverage of 75-fold using Illumina's recommended standard protocols.All generated raw reads were submitted to the European Nucleotide Archive (ENA) (http://www.ebi.ac.uk/ena/) of the European Bioinformatics Institute under the study accession number PRJEB7140.After sequencing, the reads were quality-trimmed using the CLC Genomics Workbench software version 6.0 (CLC bio, Aarhus, Denmark) and then assembled de novo using CLC Genomics Workbench with default settings.The resulting assembly files were exported as ACE files and imported into SeqSphere + software version 2.1 (Ridom GmbH, Germany).

Core genome multilocus sequene typing scheme definition and validation
For determining a cgMLST or MLST + target set we aimed to cover the whole Lp species variation.By Bayesian Analysis of Population Structure (BAPS) based on more than 800 SBT STs, Underwood et al. recently reported 15 such Lp BAPS SBT clusters [24].Therefore, we used for the cgMLST scheme definition, six finished genomes available from GenBank and 11 raw read datasets from the ENA archive that cover all BAPS SBT clusters (Table 1).ENA raw read data were again de novo assembled into draft genomes with CLC Genomics Workbench.The genome of strain Philadelphia (NC_002942.5) was used as a reference.To determine the cgMLST target gene set, a genome-wide gene-by-gene comparison was performed using the MLST + target definer function of SeqSphere + with default parameters.These to multi-copy occurrence.e Whole genome sequencing analysis corrected erroneous allelic profile of ST707 compared with original publication [24].
parameters comprised the following filters to exclude certain genes of the Philadelphia reference genome from the MLST + scheme: a 'Minimum length filter' that discards all genes shorter than 50 bp; a 'Start codon filter' that discards all genes that contain no start codon at the beginning of the gene; a 'Stop codon filter' that discards all genes that contain no stop codon, more than one stop codon or if the stop codon is not at the end of the gene; a 'Homologous gene filter' that discards all genes with fragments that occur in multiple copies within a genome (with identity 90% and > 100 bp overlap); and a 'Gene overlap filter' that discards the shorter gene from the MLST + scheme if the affected two genes overlap > 4 bp.The remaining genes were then used in a pairwise comparison using Basic Local Alignment Search Tool (BLAST) version 2.2.12 (parameters used were: 'Word size: 11', 'Mismatch penalty: -1', 'Match reward: 1', 'Gap open costs: 5', and 'Gap extension costs: 2') with the 16 query Lp chromosomes [25].
All genes of the reference genome that were common in all query genomes with a sequence identity ≥ 90% and 100% overlap, and with the default parameter 'Stop codon percentage filter' turned on (this discards all genes that have internal stop codons in more than 20% of the query genomes) formed the final MLST + scheme (downloadable from SeqSphere + software).
To validate the applicability and representativeness of the Lp MLST + target gene set, a total of 21 published high-quality genomes [24,26] -four finished genomes and 17 raw read ENA datasets that were first de novo assembled -representing 12 of the 15 BAPS SBT clusters were chosen for SeqSphere + cgMLST analysis (Table 2) performed as below.It was assumed that a well-defined cgMLST scheme should reach at least 95% of the MLST + genes present in each of the 21 validation genomes.

Core genome multilocus sequence typing analysis of humidifier related cases
To calibrate the cgMLST scheme for micro-evolutionary change, 15 newly generated Lp genomes (Table 3) representing three epidemiologically unrelated humidifier-associated paediatric LD clusters from Israel were analysed together with the finished Philadelphia and Paris strain genomes.
Thus SeqSphere + extracted the defined MLST + core genome genes from each assembly with default parameters, mainly consisting of the following settings: (i) processing options: 'Ignore contigs shorter than 200 bases'; (ii) scanning options: 'Matching scanning thresholds for creating targets from assembled genomes' with 'required identity to reference sequence of 90%' and 'required aligned to reference sequence with 100%'; (iii) BLAST options: 'Word size: 11', 'Mismatch penalty: -1', 'Match reward: 1', 'Gap open costs: 5', and 'Gap extension costs: 2'.In addition, the MLST + scheme genes were assessed for quality, i.e. the absence of premature stop codons, ambiguous nucleotides, and support of variants to reference sequence by 75% or more read nucleotide.
A core genome gene was considered a 'good target' only if all of the above criteria were met, in which case complete sequence was analysed in comparison to the Philadelphia reference and SeqSphere + assigned a numerical allele type.The combination of all core genome alleles in each strain formed an allelic profile per the proposed new scheme.From these allelic profiles a minimum spanning tree was calculated and drawn using SeqSphere + .In order to maintain backwards compatibility with Lp SBT, sequences of the seven genes comprising the allelic profile of the SBT schemes were separately extracted from finished genomes and WGS data and then queried against the SBT database in order to assign classic STs in silico.Finished genomes were from the NCBI database.b Assembled raw reads were from EBI. c Extraction of mompS allele from genomic data not possible due to multi-copy occurrence.d Ordered in accordance with SBT scheme [21]: flaA, pilE, asd, mip, mompS, proA, neuA.e Wrongly stated as LC6677 in Underwood et al. [24].

Setting up and validation of core genome multilocus sequence typing for Legionella pneumophila
Six finished genomes available from GenBank and 11 raw read datasets from ENA that cover all BAPS SBT clusters (Table 1) were used for cgMLST definition.
ENA raw read data were de novo assembled into draft genomes.The Philadelphia strain (NC_002942.5)served as reference for core genome gene definition.The resulting cgMLST scheme consisted of 1,521 genes (ca 47.2% of the complete Philadelphia strain chromosome nucleotide; list of core genes available upon request from the authors).The SBT alleles were extracted from the genomes and generated correct ST designations for nine strains.In eight strains, six of seven alleles were called correctly but the allele number for the mompS gene could not be determined due to presence of more than one copy of mompS in the genome (Table 1).
The cgMLST scheme was validated using 21 additional genomes derived from recent publications (Table 2).All 21 strains showed > 96% good MLST + targets and resulted on average in 98.4% MLST + targets.Of 20 strains for which the ST designation was available, 14 were fully extracted from the WGS data, whereas in the six remaining strains, only six of seven alleles were called correctly due to multiple mompS gene copies (Table 2).

Investigation of humidifier-associated Legionnaires' disease
To calibrate the cgMLST scheme for micro-evolutionary change and to define a cluster type (CT) threshold, 15 newly generated Lp genomes representing three epidemiologically unrelated humidifier-associated paediatric LD clusters from Israel were analysed together with the finished Philadelphia and Paris strain genomes.Characteristics of sequenced strains are summarised in Table 3. Analysis involved 11 ST1 strains from the three incidents, a concurrent ST40 strain and three 'outgroup' strains including unrelated ST1, ST40 and ST23 strains.The median coverage was 55.9 (range: 34.6-131.5)and on average 98.6% of the MLST + targets could be called.The SBT ST was called complete and correct for 14 of the 15 draft genomes.
The three incidents are described in Table 4.All three cases involved children below one year of age exposed to domestic free-standing cold-water humidifiers.In case 1 the humidifier was filled with tap water whereas in cases 2 and 3 humidifiers were filled with water dispensed through domestic filtrating machines (water bars) that used charcoal filters and ultraviolet light.In case 1 Lp ST1 was detected by culture and polymerase chain reaction (PCR) of the patient's sputum and Lp ST1 was also recovered from humidifier residual water.Notably, environmental sampling revealed a ST40 strain which was not present in clinical samples.In case 2, diagnosis was made using urinary antigen testing and no sputum was available for analysis.Multiple environmental samples obtained from the water system, water filtrating machine, and humidifier were all positive for Lp ST1.In case 3, sputum was culture negative but PCR was positive for Lp.Direct SBT performed on sputum revealed a co-infection with Lp serogroup 1 ST1 and Lp serogroup 3 ST93.Environmental samples obtained from the water system, water filtrating machine and humidifier were all positive for Lp ST1 and some were also ST93 positive.
All 17 analysed genomes (including the 15 from Israel as well as the Philadelphia and Paris strain) shared in total 1,446 of the 1,521 defined core genome genes (data not shown -allelic profiles available upon request).From these allelic profiles SeqSphere + calculated and drew a minimum spanning tree where the number of differing alleles is given along the branches (Figure).For case 1, identical clinical and environmental ST1 strains (Lp-2002694p8 and Lp-56207) were found (no differing alleles) and a concurrent ST40 (Lp-2002694 p7), which as expected, did not cluster with implicated ST1 strains.This ST40 strain exhibited a difference of six alleles (of the 1,521 core genome genes) from an unrelated ST40 strain serving as an 'outgroup' for ST40.Of the four environmental strains representing the chain of transmission in case 2, three were identical ST1 strains (Lp-119, Lp-121 and Lp-122) and one had only one differing allele (Lp-120).Case 3 demonstrated more complex clustering of environmental strains into

Figure
Use of a minimum spanning tree generated from allelic profiles of 1,446 core genome genes shared by 17 Legionella pneumophila strains analysed, to investigate paediatric humidifier-associated Legionnaires' disease cases The Lp strain numbers are described inside the circles.Finished genomes of the 'Philadelphia' (ST36) and 'Paris' (ST1) strains obtained from the National Center for Biotechnology Information (NCBI) were used as reference (turkoise blue).Strains corresponding to the three ST1 humidifier-associated epidemiological clusters are designated in green (epidemiological cluster 1; includes also one ST40 'bystander' strain), yellow (epidemiological cluster 2) and orange (epidemiological cluster 3).Epidemiologically unrelated strains, including ST1 and ST40 'outgroup' strains (Lp-032 and Lp-001, respectively) and a ST23 (Lp-012) are designated in dark blue.The number of differing alleles is stated along the branches of the tree.Lines connecting strains within cluster type distance are highlighted by pale red background shading.The lengths of the branches reflecting distances between strains are drawn in a logarithmic scale.
two pairs (one identical pair formed by Lp-282-1 and Lp-284, and one pair with three differing alleles formed by Lp-285 and Lp-283) and a fifth distinct strain (Lp-286-1) which on epidemiological grounds was considered the most likely cause of infection (Lp strain recovered from humidifier residual water and therefore most likely to have been aerosolised during humidifier use).A subsequent cloning experiment performed on the patient's sputum extract, revealed sequences unique to at least two of the environmental strains (Lp-286-1 and Lp-285), suggesting co-infection (data not shown).

Discussion
WGS-analysis is emerging as the optimal molecular epidemiology tool for microbial genotyping but its application and implementation are limited by challenges in timely analysis of data and standardised integration into scalable classification schemes.While most WGS-based epidemiological investigations published to date have relied on mapping of SNPs, extension of the classical MLST approach [27] to a gene-by-gene typing scheme based on the entire core genome [16] is a promising approach for a standardised, portable and expandable typing method.The current study presents a novel core-genome allele-based typing scheme for Lp based on a standardised analysis of WGS of an internationally representative and biologically diverse Lp collection of genomes.The proposed scheme follows several recently proposed cgMLST schemes for pathogens of public health importance including Staphylococcus aureus, Listeria monocytogenes, Escherichia coli, Neisseria meningitidis and Mycobacterium tuberculosis [13][14][15]28,29].
Only a few clusters of LD have been investigated to date using a WGS-based approach.One study provided a retrospective analysis of a community-acquired LD cluster in which WGS yielded comparable results to that of conventional SBT [26].Notably, WGS could not identify the most likely source of infection.The two clinical and three environmental isolates analysed in that study were not more than 15 SNPs apart [26].Another study provided a real time investigation of a nosocomial LD cluster involving two patients [30] in which WGS had a greater resolution as compared with conventional typing and was able to link the two cases with an environmental strain and possibly to a past case.Related strains in that study were 17 SNPs apart.The reliance on SNP mapping makes those two reports difficult to reproduce and to compare, especially given the differences in reference genomes and bioinformatics pipelines used, as well as software parameter selections.Nevertheless, both papers contribute to the proof of concept of harnessing WGS for Lp investigation.
In our report, WGS of Lp strains related to three independent LD incidents was successful in demonstrating the phylogeny of implicated clinical and environmental strains.We chose to focus on Lp ST1 which is one of the most abundant ST globally and by far the most common cause of LD in Israel [31] and thus conventional tools such as SBT are not always powerful enough for epidemiological purposes.Moreover, it has recently been shown that Lp ST1 could be further characterised using additional typing methods such as spoligotyping [32].Using the 'Paris' ST1 type strain and an 'outgroup' ST1 strain our analysis shows that the cgMLST scheme of ca 1,500 genes has an adequate discriminatory power and could resolve clustering of multiple strains in the ST1 complex.Discriminatory power in concordance with epidemiological data are among the most important performance criteria set to evaluate proposed typing methods [33].In that respect, the observed clustering pattern of our study isolates suggests that a difference of up to four alleles between strains may serve as a preliminary threshold value for defining a WGS cluster.Nevertheless, this should be further evaluated and fine-tuned as additional genomic epidemiology data on Lp accumulates.
We included in our analysis LD cases diagnosed by three accepted laboratory modalities, being sputum culture, urinary antigen and sputum PCR in order to demonstrate the usefulness of WGS-based typing in all typical epidemiological scenarios.Of note is that case 3 was more difficult to resolve as strain clustering yielded three unrelated ST1 groups.While spontaneous mutations could provide a possible explanation, we believe that this is the result of infection with multiple ST1 strains.This reflects the inherent limitation of culture-based methods used in water testing for Lp, where picking out a single colony from similar morphotypes may overlook the presence of multiple strains.This limitation could be resolved by liberal use of SBT target screening before colony picking and in the future via metagenomic approaches.
One notable hindrance to routine application of WGS for Lp genotyping is the failure to determine the mompS allele number (and as a result determine the ST) for some strains, regardless of whether SNP mapping or cgMLST is being used.This phenomenon results from the presence of multiple copies of the mompS gene in many Lp strains, which are commonly non-identical, a fact not known when the SBT scheme was initially designed [7].Current SBT primers used for Sanger sequencing amplify only a single copy of the gene due to sequence variation in the noncoding flanking region and thus generate consistent ST designations [7].Therefore, in the future, tools for extraction of the correct mompS allele from finished genomes harbouring multiple gene copies must take synteny information, e.g. the primer sequences, into consideration to choose the correct gene copy for allele calling.Remediation of the problem for draft genomes is more difficult to achieve as the rather short second generation sequencing reads from both copies are assembled or mapped into a single contig.Notably, resolution of this limitation would be highly desirable for routine WGS application for Lp as backwards compatibility would be maintained.
Our report also highlights humidifier-associated paediatric LD as a continuously emerging risk for LD.After the first paediatric case was acknowledged and reported in Eurosurveillance [10], additional cases have been reported in Europe including Spain [34] and Cyprus [35], the latter involving a nosocomial outbreak in a nursery.The public health response to the first case in Israel was coordinated by the National Programme for Legionellosis Prevention.As part of this response, the Israeli Ministry of Health released specific guidance to professionals and members of the public.Thereafter, scheduled press releases occur every winter.As mandated, cold water humidifiers sold in the country are also labelled with a safety hazard warning through the National Standards Institute of Israel.Moreover, the Israeli Paediatrics Association has released a position paper regarding domestic humidifier use highlighting the benefits and risks.Nevertheless, paediatric humidifier-associated LD is still a public health challenge and deserves more attention.
In conclusion, we devised a WGS-based cgMLST scheme for typing of Lp, which provided high-resolution analysis of Lp strains within the same clonal complex.cgMLST appears to have satisfactory discriminatory power for LD cluster analysis and is advantageous over mapping followed by SNP calling as it is easier for standardisation and dissemination.cgMLST thus has the potential for becoming a gold standard tool for LD investigation.Humidifiers pose an ongoing risk as vehicles for LD and should be considered in cluster investigation and control efforts.

Table 1
Finished genomes a and assembled raw reads b used for Legionella pneumophila core genome definition (n=17) cReference genome.d Extraction of mompS allele from genomic data not possible due

Table 2
Finished genomes a and assembled raw reads b used for Legionella pneumophila core genome validation (n=21) BAPS: Bayesian analysis of population structure; EBI: European Bioinformatics Institute; N/A: not applicable; NCBI: National Center for Biotechnology Information; SBT: sequence-based typing; ST: sequence type.a

Table 3
Whole genome sequencing data of Legionella pneumophila strains included in the study a ENA: European Nucleotide Archive; MLST: multilocus sequence typing; SBT: sequence-based typing; ST: sequence type.Extraction of mompS allele from whole genome sequence data not possible due to multi-copy occurrence.
a ENA study number PRJEB7140.b

Table 4
Characteristics of paediatric humidifier-associated Legionnaires' disease cases included in the study PCR: polymerase chain reaction; ST: sequence type.