Epidemiological investigation of Pseudomonas aeruginosa isolates from a six-year-long hospital outbreak using high-throughput whole genome sequencing

L A Snyder1,2, N J Loman1, L A Faraj3, K Levi4, G Weinstock5, T C Boswell4, M J Pallen (m.pallen@warwick.ac.uk)6, D A Ala’Aldeen3 1. Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, United Kingdom 2. Kingston University, Kingston upon Thames, Surrey, United Kingdom (current affiliation) 3. Molecular Bacteriology and Immunology Group, School of Life Sciences, University of Nottingham, Nottingham, United Kingdom 4. Nottingham University Hospitals NHS Trust, Nottingham, United Kingdom 5. Department of Genetics, Washington University School of Medicine, St. Louis, Missouri, United States 6. Division of Microbiology and Infection, Warwick Medical School, University of Warwick, Coventry, United Kingdom


Introduction
Pseudomonas aeruginosa is a Gram-negative bacterium that is widespread in the environment and engages in a wide variety of interactions with eukaryotic host organisms.It is a common opportunistic pathogen in humans, causing a broad range of infections in community and healthcare settings.The most serious manifestations of infection include bacteraemia (particularly in neutropenic patients), pneumonia (particularly in cystic fibrosis patients and critically ill patients), urinary tract infections and wound infections (especially in patients with burn injuries) [1,2].This organism is intrinsically resistant to many antibiotics and in recent years resistance has emerged to what were previously antimicrobial agents of choice [3,4].
As P. aeruginosa is ubiquitous in the environment and also forms part of the resident microbiota of many patients, molecular typing methods are required to identify the sources and routes of transmission within the healthcare setting.Current molecular epidemiological typing systems, such as pulsed-field gel electrophoresis (PFGE) and variable-number tandem repeat (VNTR) analyses, have provided evidence of clonal outbreaks of P. aeruginosa affecting clinics or hospitals across the United Kingdom (UK) [5,6].For example, such approaches have identified several clones (the Liverpool, Midlands 1 and Manchester strains) affecting multiple cystic fibrosis patients in the UK [5][6][7].However, these typing methods are of limited resolution when compared to whole genome sequencing.In addition, traditional typing methods fail to provide any insights into the biology of outbreak strains, or show how they might be evolving over time.
A prolonged P. aeruginosa outbreak at Nottingham University Hospitals National Health Service (NHS) Trust -City Campus (UH-NHST-CC) that lasted from 2001 to 2007 illustrates some of these issues [8].Although the outbreak has been brought under control, one to two cases are still seen each year, so it has not been completely eliminated.Here, we have seen a pattern of apparently endemic infection, associated with a single clonal lineage, which was isolated from over 40 patients and also from the hospital environment, as evidenced in previous studies of the isolates [8] and ongoing active surveillance and molecular typing.
The outbreak was first detected when infections caused by five multi-resistant isolates were noted on an 18-bed bone marrow transplant unit at UH-NHST-CC in April 2001.These isolates were resistant to ceftazidime (CTZ), ciprofloxacin (CIP), gentamicin (GEN), piperacillin-tazobactam (TZP), and the carbapenems, but susceptible to amikacin (AMK) and colistin (COL).Surveillance was enhanced, laboratory records were reviewed to identify other cases, and environmental sampling was undertaken.Over the next two years, further isolates from the critical care unit and a new bone marrow transplant unit at the same hospital were identified, initially on the basis of the same antibiogram.As a result of the infection control programmes initiated in response to these outbreaks, infections were identified in 32 haematology patients [8].Some patients acquired the multi-resistant P. aeruginosa without temporal overlap with other positive patients, suggesting an environmental source.Infection control investigations at the time suggested that the outbreak was associated with colonisation of hand washing basins.Initial molecular epidemiological typing was performed by PFGE using restriction enzyme SpeI [9,10] and identified a single outbreak strain among clinical and environmental isolates [8].Similarity between all isolates was >85% and these formed a single group in cluster analysis, defined as the Nottingham/Trent cluster.In addition, the recommendations of Tenover et al. [10] were followed, which stipulate that isolates belonging to the same strain should differ by a maximum of four fragments.Isolates were additionally typed by random amplification of polymorphic DNA (RAPD) [11] and all demonstrated the same pattern [8].It was found that in critical care, tracheostomy was the strongest risk factor for acquisition and that the tracheostomy inner tubes were being cleaned in hand wash basins that were found to be contaminated.The correct fitting of waste trap heaters, the removal of one persistently colonised sink, enhanced surveillance, and a review of equipment and sink cleaning, particularly in the case of tracheostomy equipment helped to bring the outbreak under control.In clinical haematology, oral colistin was introduced for selective decontamination of the gut during neutropenia and use of ciprofloxacin, to which the strain is resistant, was reduced for neutropenic prophylaxis.
Whole genome sequencing, facilitated by the advent of high-throughput approaches, brings the promise of single-base-pair resolution between isolates, making it the ultimate molecular typing method for bacteria.Several recent studies have shown that analysis of single nucleotide polymorphisms (SNPs) in bacterial genomes provides a means of determining relatedness between epidemiologically linked isolates and tracking bacterial evolution over periods of months to years [12][13][14][15][16].In particular, several studies have tracked the genetic changes that emerge during long-term colonisation of cystic fibrosis patients [17,18].In addition, genome sequencing provides a strain-specific list of protein-coding genes, which, through comparisons with the genes sets of other strains from the same species can provide insights into the unique biology of an epidemic strain.We therefore applied high-throughput sequencing to five isolates from the UH-NHST-CC hospital outbreak to determine whether genome sequencing can inform our understanding of the epidemiology and biology of a hospital pathogen in this setting.

Methods Microbiology
Isolates of P. aeruginosa were collected from the microbiological specimens of patients cared for at the Queen's University Medical Centre over a six-year period from January, 2002 to December, 2007.Isolates were frozen on beads and stored at -80°C.P. aeruginosa isolates were sent to the Health Protection England's (HPE) Laboratory of HealthCare Associated Infection in London for analysis by PFGE.Four isolates from different time points in the outbreak were selected for genome sequencing in this study on the basis of PFGE and RAPD results, as well as an additional isolate from June 2001 (Table 1).This additional isolate was identified as part of the look-back exercise initiated after the first outbreak; its PFGE profile was established by the HPE.The antimicrobial susceptibilities of isolates were tested using the disk diffusion method against AMK, aztreonam (AZT), CTZ, COL, CIP, GEN, meropenem (MER), tobramycin (TOB) and TZP and according to British Society for Antimicrobial Chemotherapy (BSAC) guidelines [19].

Isolation of genomic DNA and high-throughput sequencing
Genomic DNA was obtained from colony-purified P. aeruginosa, using the DNeasy DNA extraction kit according to the manufacturer's instructions (Qiagen).DNA from isolate PANOT106 was sequenced using whole genome shotgun and 2.5-kilobase paired-end 454 protocols in accordance with the manufacturer's instructions (Roche) at the University of Liverpool's Centre for Genomic Research.Isolates PANOT106, PANOTK11, PANOT101, PANOT340, and PANOT738 were sequenced one sample per lane using the Illumina GA2, running 35 cycles with paired-end libraries (~250 base-pair insert size).Sequencing on the Illumina GA2 platform was carried out at the sequencing centre of the University of Washington at St. Louis.

Genome assembly
Roche 454 reads for isolate PANOT106 were assembled de novo with Newbler v2.3 (Roche) to create a reference assembly, using the default settings with -g enabled to produce contig graphs for use later in the variant detection pipeline.Nesoni 0.51 (Victorian Bioinformatics Consortium) was used on the Illumina reads for PANOT106 to correct errors in the reference assembly, particularly short indels in regions with low depth of coverage or in homopolymeric tracts.Nesoni parameters were: depth 10; purity 0.8; ambiguitycodes 1; fidelity 1.0; monogamous 1; trim 0; max-pairsep 250; only-pairs 1; suffix1 /1; suffix2 /2; same-dir 0; circular 0; savehits 0. Bowtie v0.12.0 [20] was used to align the PANOT106 Illumina reads to complete P. aeruginosa genomes available in GenBank, using parameters --solexa -quals -S -m 1 -p 4. The progressive Mauve component of Mauve (version 2.3.1) was used to map the PANOT106 reference assembly to the genome sequence of P. aeruginosa PAO1.

Creation of a well-validated single nucleotide polymorphism (SNP) set
A series of filters were applied to the sequence data to create a set of high-confidence SNP variants.Bowtie v0.12.0 was used to align the Illumina data for each of the five sequenced strains to the PANOT106 reference assembly and variants extracted using samtools with parameters -cv -N n to create set <VARIANTS>.The value of n was set to the number of isolates in the sample.

Analysis of gene content
Protein-coding sequences (CDSs) in the PANOT106 assembly were identified using Gene Locator and Interpolated Markov ModelER (GLIMMER) [23].Protein Basic Local Alignment Search Tool (BLASTP) [21] was used to compare the set of CDSs in PANOT106 with the predicted protein repertoires of other completed P. aeruginosa genome sequences.Regions of difference (RODs) between the PANOT106 and the known P. aeruginosa pan-genome (as defined by complete genomes in GenBank) were identified on the basis of the following criteria: the ROD must contain ≥ 10 strainspecific CDSs, which have no orthologues in other genomes (as defined by mutual best-hit analysis), with the ROD interrupted by no more than five genes already present in the pan-genome.Orthologues were computed using OrthoMCL v2.0 with default parameters [27].Singletons (coding sequences which do not have orthologues in other outbreak isolates) were verified by translated nucleotide databases BLAST (TBLASTN) searches against the original sequence.Additional information with regards to the annotated features made use of the Pseudomonas Genome Database [28].

Assembly of the genome sequence data
Using the Roche 454 next-generation genome sequencing platform, 6,720,954 bases were generated from the DNA extracted from isolate PANOT106.Reads were assembled, using Newbler, into 1,059 contigs, which were further assembled into 19 scaffolds using paired-end sequence data to determine the relative positions and distances of contigs.The N50 of the scaffolds is 825,048 and the average paired end distance is 2,365 bases, with a standard deviation of 591 bases.
The next-generation sequencing data generated on the Illumina platform for isolates PANOT106, PANOTK11, PANOT101, PANOT340, and PANOT738, is summarised in Table 2. Coverage with the 35 base Illumina reads ranged from 62X to 90X.The data for isolate PANOT106 was used to correct the assembly of this genome sequence based on the 454 data.This resulted in 283 total edits to the original assembled sequence.

Antibiogram differences
The only detected difference in antibiotic resistance amongst the five isolates was AMK resistance in PANOTK11 (data not shown).Resistance to aminoglycosides is typically associated with aminoglycosidemodifying enzymes.Each of the outbreak strains carry identical aminoglycoside 3'-phosphotransferase type IIb coding sequences homologous to the Pseudomonas strain PAO1 CDS with accession PA4119.Additionally there were no SNPs detected upstream to suggest a change in promoter in PANOTK11.The alternative resistance-conferring aminoglycoside 4'-O-adenylyltransferase IIb [29] was not found in any of the outbreak strains.We therefore conclude that AMK resistance in PANOTK11 is probably due to impermeability or efflux, although even if true, this fails to explain why resistance to other aminoglycosides was unaffected [30].

Novel sequences within the genomes of the outbreak isolates
There are six large regions of difference between PANOT106 and the P. aeruginosa pan-genome (Figure panel A-F), five of which appear to be prophages or prophage remnants.In particular, PANOT106 contains a region that resembles the serotype-converting P. aeruginosa bacteriophage D3 and a region that encodes homologues of phage structural proteins from phi CTX, a cytotoxin-converting phage of P. aeruginosa.The only non-prophage region of difference contains copper-and arsenic-resistance genes and is homologous to a similar region in the genome of Stenotrophomonas sp.SKA14.
PANOTK11 has a 48 kb sequence that is not present in the other strains sequenced here (Figure panel G).This region was assembled into a single contig and was annotated to contain 24 CDSs.Although this region has no homologue in any of the other strains sequenced here, it is homologous to sequence data from isolate PACS171b [31], which was isolated from the oropharynx of a nine month-old cystic fibrosis patient [32].Of the 48 kb, all but 13 kb share similarity with isolate PACS171b; these 35 kb mostly contain hypothetical genes and a putative restriction-modification system and putative sigma-54 transcriptional regulator.Within the 13 kb region that is different from isolate PACS171b, isolate PANOTK11 contains a member of the DEAD box helicase family, an OLD family endonuclease, and a hypothetical gene.
No novel sequences were found within the genomes of PANOT101, PANOT340, and PANOT738, in comparison with the P. aeruginosa pan-genome.

Flagellum operon
All five of the outbreak strains are missing a portion of the flagellum operon compared to that present in P. aeruginosa strains PAO1 and LESB58.The corresponding CDS PA1088, encoding flagellar glycosyl transferase FgtA is not present in the outbreak isolates sequenced here.The fliC (PA1092) gene and CDSs PA1092-PA1096 are present, although these are divergent in their nucleotide sequences.

Genome-wide single nucleotide polymorphism (SNP) analysis of outbreak strains
The PANOT106 sequence, assembled from Illumina and 454 sequence data, aligns well with the P. aeruginosa strains PAO1 (95.3% sequence identity),  3).Twentyfour of the SNPs result in non-synonymous changes to the associated encoded amino acid in the PAO1annotated coding region (Table 4).This high frequency of non-synonymous changes (24 of 37) suggests that some of the SNPs represent adaptive changes.
Nineteen of the SNPs separate PANOTK11 from the other four UH-NHST-CC Nottingham isolates (Table 3).None of these nineteen SNPs are found in any of the other four outbreak isolates, which all share greater similarity to the reference strain PAO1 at these SNP loci.These SNPs in this isolate, the 48 kb region present only in this isolate, and the AMK resistance of this isolate indicate that PANOTK11, despite being identified as related to the other outbreak isolates as part of the Nottingham/Trent cluster via PFGE and RAPD, is not part of the same outbreak and cannot represent the index case.Despite the earlier typing results, PANOTK11 is a phylogenetic outlier when compared to other outbreak isolates on the genomic level.Interestingly, one of these PANOTK11-specific SNPs (PNSNP12) maps to lasR (Table 4), encoding a transcriptional regulator that responds to a homoserine lactone signal to activate expression of P. aeruginosa virulence factors through quorum sensing.Mutations in LasR have been reported in several contexts, including adaptation to the airways of cystic fibrosis patients [33].PNSNP12 does not appear to be within the helixturn-helix DNA binding domain of the encoded protein according to the available sequence and functional information in UniProt [35], and to predictions by the helixturnhelix (HTH) software tool [34] but SNP mutations in other LasR regions, including the ligand binding domain, have been reported previously [36].Another of the PANOTK11 SNPs (PNSNP34) maps to algB (Table 4), encoding an NtrC-like response regulator, which controls alginate biosynthesis in mucoid P. aeruginosa [37].In this case the SNP falls within the helix-turnhelix motif predicted by HTH [34] and described previously [38].This SNP may therefore impact upon the virulence of PANOTK11 and differences in alginate biosynthesis should be explored.
PANOT101 and PANOT106 were isolated a year apart (2002 and 2003, respectively) and share many of the same non-synonymous SNPs, apart from five identified here (Table 3).PNSNP10, within a sensor/response regulator hybrid (PA0928), and PNSNP25, within nrdG, in PANOT106 are also shared with isolates PANOT340 and PANOT738 (Tables 3 and 4).PNSNP11 and PNSNP13 in a hypothetic protein (PA0938) and two-component sensor (PA1458), respectively, are only found in PANOT106 in this study.Likewise, PNSNP26 in PANOT101 (Table 3) is only found within this isolate in this study.PNSNP26 is within speC encoding ornithine decarboxylase (PA4519; Table 4) part of the biosynthetic pathway to produce putrescine, required for growth, but loss of function of this protein can be compensated through an alternative putrescine biosynthesis pathway using speA [39].The two subsequent strains, PANOT340 and PANOT738, isolated in 2005 and 2007, respectively, show a progressive increase in the number of SNPs relative to PANOT106.These three isolates share two non-synonymous SNPs (Table 3), one, PNSNP10, within a sensor/response regulator hybrid (PA0928, gacS) and one, PNSNP25, in a class III (anaerobic) ribonucleoside-triphosphate reductase activating protein gene (PA1919, nrdG).Five shared non-synonymous SNPs (PNSNP2, PNSNP5, PNSNP19, PNSNP23, and PNSNP32) are present in isolates PANOT340 and PANOT738, but not in the other outbreak isolates.In isolate PANOT738, an additional eight SNPs are present, three leading to non-synonymous changes (Table 3) and five in intergenic regions.This progressive accumulation of SNPs is consistent with an ancestor-descendant relationship.The acquisition of five SNPs in two and a third years (5 SNPs in 28 months, or 5.6 months per SNP) and a further eight SNPs in two and two third years (8 SNPs in 32 months, or 4 months per SNP), with 13 over a span of five years from January 2003 to December 2007 (13 SNPs in 60 months, 4.6 months per SNP) suggests that a new SNP is acquired in this lineage on average every four to five months.
The five non-synonymous SNPs in common between isolates PANOT340 and PANOT738 include PNSNP32 which is associated with a type II citrate synthase (PA1580, gltA), PNSNP2 in a hypothetical protein that may be a thioesterase (PA3130), PNSNP5 in a putative methylated-DNA--protein-cysteine methyltransferase (PA3596), PNSNP23 in a probable two-component response regulator (PA5364), and PNSNP19 in an ATPdependent protease peptidase subunit of heat shock protein HslV (PA5053, hslV).The small form of the citrate synthase protein encoded by gltA is involved in the Krebs cycle and is expressed in excess during stationary phase growth [40].Whether PNSNP32 has some bearing on the stationary phase growth rate of isolates PANOT340 and PANOT738 in comparison with isolate PANOT106 requires further investigation, as this may impact upon survival of the bacteria in the hospital setting.PNSNP5 is within a CDS potentially encoding a methylated-DNA--protein-cysteine methyltransferase, which is involved in the repair of alkylated DNA (EC 2.1.1.63).Sensitivity testing for alkylating agents of the outbreak isolates may be warranted in light of the PNSNP5 data.
The remaining eight SNPs present only in isolate PANOT738 are associated with non-synonymous changes in three different CDSs and five intergenic regions.The three SNPs in CDSs are PNSNP33 found within a putative class III pyridoxal phosphate-dependent aminotransferase (PA0530), PNSNP27 in tadZ (PA4303), and PNSNP16 in a putative transcriptional regulator (PA2846).The CDS for tadZ is within the widespread colonisation island (WCI) encoding tight adherence pili.The genes of the WCI are essential for biofilm formation and colonisation [41], which are important virulence factors in P. aeruginosa.The PNSNP27 within tadZ may therefore influence the success of isolate PANOT738 in forming biofilms and colonising the host, medical devices, or even medical surfaces such as the persistently colonised sink.Isolate PANOT738 also contains PNSNP16 that is within PA2846, which is one of several predicted transcriptional (PA1430) ID: identity; SNP: single nucleotide polymorphism.
and response regulators (PA0928, PA1458, PA2656, PA5364, PA5483) that contain SNPs in these outbreak isolates, suggesting that differences in regulation may play a role in the phenotypes observed.

Discussion
Here, we have shown that genome sequence data obtained with high-throughput sequencing technologies can provide novel insights into the epidemiology and biology of hospital isolates of P. aeruginosa.One isolate, PANOTK11, was determined to be an outlier and not part of the same outbreak lineage as the other four isolates.This is based on genomic level sequence differences; previous typing with PFGE and RAPD placed all five of these isolates into a single lineage.
Amongst the related isolates, we have demonstrated that P. aeruginosa lineages from a hospital setting contain a sufficient number of SNPs to allow the construction of an epidemiological narrative -an important benchmarking exercise that lays the foundations for future investigations into the fine-grained genomic epidemiology of this troublesome pathogen.a SNP ID from Table 3. b SNP from one of the isolates as listed in Table 3. Reverse complementary bases, in parenthesis, have been entered where the PAO1 strand is negative relative to the data in Table 3. c Base at this SNP location within the PAO1 genome sequence.d Position of the SNP within the PAO1 genome sequence.e CDS locus identifier from the PAO1 genome sequence.
In addition, genome sequencing has highlighted the existence of strain-specific gene clusters that might underpin the tenacious survival of these strains within our hospital over many years.Several SNPs which may confer phenotypic differences between the isolates have been identified and bear further investigation, particularly as some may play a role in survival of isolates in the hospital.These include SNPs within PA1580 that could influence stationary phase growth rate and impact on the survival of the bacteria in the hospital setting, within PA3596 that could influence the sensitivity to alkylating agents, and within PA4303 that may be important for biofilm formation.
This study illustrates the potential of the first generation of high-throughput sequencing platforms in the investigation of hospital outbreaks.Relentless improvements in cost and performance (particularly those delivered recently by bench top sequencing [42]) as well as data analysis software, are taking these technologies ever closer to the clinical microbiology laboratories.With a price tag now comparable to the cost of traditional molecular typing, bacterial genome sequencing is now poised to impact on clinical practice.

Figure
FigureRegions of sequence difference between NottinghamPseudomonas aeruginosa isolates and the P. aeruginosa pan-genome difference between isolate PANOT106 and the P. aeruginosa pan-genome Region of sequence difference between isolate PANOTK11 and the P. aeruginosa pan-genome

Table 1
Selection for sequencing of Pseudomonas aeruginosa isolates originating from an outbreak in Nottingham that occurred between 2001 and 2007, United Kingdom 10 -N 2 -l 30, equating to the following filtering options: Minimum root mean square mapping quality for SNPs, 25; Minimum root mean square mapping quality for gaps, 10; Minimum read depth, 10; Maximum read depth, 90; Minimum indel score for nearby SNP filtering, 25; SNP within INT bp around a gap to be filtered, 10; Window size for filtering dense SNPs, 10; Max number of SNPs in a window, 2; Window size for filtering adjacent gaps, 30.

Table 2
Illumina sequencing results for Pseudomonas aeruginosa isolates originating from an outbreak in Nottingham that occurred between 2001 and 2007, United Kingdom bly generated a large number of variant calls.Several hundred variants were discarded during the filtering of variants to identify high-quality informative SNPs that vary between the outbreak isolates and which are associated with 37 polymorphic loci (Table

Table 3
Comparison of well-validated single nucleotide polymorphisms in the Pseudomonas aeruginosa Nottingham outbreak isolates

Table 4
Non-synonymous single nucleotide polymorphisms in the genome sequence data of Pseudomonas aeruginosa isolates from the outbreak in Nottingham CDS: protein coding sequence; ID: identity; SNP: single nucleotide polymorphism; Stop: termination codon.