Genetic structure of SARS-CoV-2 reflects clonal superspreading and multiple independent introduction events, North-Rhine Westphalia, Germany, February and March 2020

We whole-genome sequenced 55 SARS-CoV-2 isolates from Germany to investigate SARS-CoV-2 outbreaks in 2020 in the Heinsberg district and Düsseldorf. While the genetic structure of the Heinsberg outbreak indicates a clonal origin, reflecting superspreading dynamics from mid-February during the carnival season, distinct viral strains were circulating in Düsseldorf in March, reflecting the city’s international links. Limited detection of Heinsberg strains in the Düsseldorf area despite geographical proximity may reflect efficient containment and contact-tracing efforts.


Supplementary Text
RNA extraction and Nanopore sequencing Respiratory samples from nasopharyngeal swabs were used for total nucleic acid extraction using the EZ1 Virus Mini Kit v2.0 on an EZ1 Advanced XL (Qiagen, Germany) according to manufacturers' instructions. SARS-CoV-2 was detected as described 1 with a plasmid-standard for quantification. Viral RNA was reverse-transcribed to single-strand cDNA using random hexamers and SuperScript 2 . Viral cDNA was PCR-amplified using the Artic network SARS-CoV-2 protocol with V1 primers 3,4 . Prior to library preparation, for each sample, PCR pools 1 and 2 were combined (500ng DNA per pool). Sequencing was carried out on the Oxford Nanopore Flongle and MinION devices (Supplementary  Table S1), utilizing FLO-FLG001 and FLO-MIN106 flow cells and the SQL-LSK109 ligation sequencing kit. On the MinION device, barcoding was carried out with the EXP-NBD104 and EXP-NBD114 barcoding kits (Supplementary Table S1). Flongle runs were carried out without multiplexing.
The pipeline comprises the following steps:  Length filtering of the generated reads to identify and remove reads that emanate from chimeric PCR fragments.
artic gather --min-length 300 --max-length 700 --directory out_dir  Demultiplexing with porechop v0.3.2 6 . The software automatically determines the utilized adapter sets und bins the reads according to their barcode with a lower limit of at least 80% identity.
Across the cohort, we identified the following recurrent error modes: i) Medaka over-called polymorphisms in low-coverage regions; these are automatically masked out by the ARTIC pipeline; ii) in rare instances, adapter sequences were not removed completely by the ARTIC pipeline preprocessing steps, leading to clustered false-positive allele calls in Medaka, but not in Nanopolish; iii) in rare instances, Nanopolish detected likely false-positive polymorphisms not called by Medaka and exhibiting strong strand bias; iv) potentially multi-allelic positions were, though sometimes identified by Medaka as 'heterozygous', often missed by both callers.
For each sample, a final consensus FASTA file was produced, in which i) regions with false-positive calls were masked out with 'N' characters; ii) ambiguous or potentially multi-allelic positions were represented with the corresponding IUPAC ambiguity codes.

Direct cDNA viral sequencing
For a proof-of-concept experiment, we selected a sample (NRW-39) with high viral load and carried out reverse transcription (RT) as described above, using a) random hexamer primers and b) a subset of the Artic primers. The resulting single-stranded cDNA pools were converted into double-stranded DNA with the NEBNext® Ultra™ II Non-Directional RNA Second Strand Synthesis Module (NEB) and sequenced on separate Flongle flow cells as described above. The generated sequencing reads were basecalled as described above and aligned with minimap2 9 (version 2.17-r974-dirty, flags -a -x map-ont) against a combined FASTA file containing the GRCh38 version of the human reference genome 10 and the SARS-CoV-2 reference genome (MN908947.3) 11 . All generated sequencing data were submitted to SRA (Sequencing Read Archive; Supplementary Table S1).
Manual inspection with IGV 8 showed that the alleles of the polymorphic positions called based on the standard Artic protocol (bottom panel in the following figure) were recovered from the direct cDNA sequencing data based on RT primed with Artic primers (top panel in the following figure); the number of positions with low-frequency non-consensus alleles, however, was increased when compared to the standard Artic run (we note, however, that this might also be due to slightly increased error rates on the Flongle platform compared to the MinION).
By contrast, the data generated based on random RT priming exhibited significantly increased noise levels (middle panel in the following figure).
We conclude that further work, including on samples with lower viral load, is required to assess the potential of direct cDNA sequencing from SARS-CoV-2 patient material.
IGV-based comparison of direct cDNA and PCR-based sequencing data and variant calls on sample NRW-39. Panels, from top to bottom: Medaka VCF, based on Artic PCR; coverage and reads, RT primed with Artic primers; coverage and reads, RT primed with random hexamers; coverage and reads, Artic PCR. Note that the sequencing data in the upper two panels were generated on the Oxford Nanopore Flongle platform, which typically exhibits slightly higher error rates than the MinION platform, which was used for the generation of the sequencing data visualized in the bottom panel.

Illumina sequencing and bioinformatics analysis
To verify the accuracy of the Nanopore-based viral assemblies, viral genome sequencing was repeated for the first 12 isolates of our cohort (Supplementary Table S1). DNA amplicon samples, generated with the Artic protocol as described above, were quantified by fluorometric assay (Qubit DNA HS Assay, Thermo Fisher Scientific) and quality was assessed by capillary electrophoresis using the Fragment Analyzer and the "High Sensitivity NGS Fragment Kit (1-6000bp) Assay" (Agilent Technologies, Inc. Santa Clara, USA). All samples showed high quality amplicons with 400bp fragment length. Library preparation was performed using the "TruSeq Nano DNA Library Prep Kit -Part # 15041110 Rev. D protocol" (Illumina Inc. San Diego, USA) by skipping the fragmentation step and starting with the repair ends steps. Briefly, 200 ng of each sample was used for end repair, overhang conversion, adenylation, adapter ligation and library amplification. Bead purified libraries were normalized and finally sequenced on the MiSeq system (Illumina Inc. San Diego, USA) with a read setup of 2x251bp. The bcl2fastq tool was used to convert the bcl files to fastq files as well for adapter trimming and demultiplexing. Bioinformatic analysis was based on iVar 12 (v 1.0.1). The generated assemblies (consensus FASTA sequences) were compared to the Nanopore-based assemblies and positions with discrepant alleles were assesses with IGV 8 (Supplementary Table S3).

Genomic neighbourhood analysis and minimum spanning tree computation
A genomic neighbourhood analysis, as defined below, was carried out for all isolate sequences present in GISAID 13 as of 18 April 2020, including the Düsseldorf/Heinsberg sequences generated as part of this study.
1. All sequences in GISAID were aligned against the SARS-CoV-2 reference genome (GenBank: MN908947.3). Isolate sequences with ambiguous characters, excluding 'N', and sequences longer than 29,903bp, were excluded from further analysis. Pairwise alignments were computed with edlib 14 in ends-free alignment mode. If the first or last 10 alignment columns contained any gaps, the isolate was excluded from further analysis. If necessary, the pairwise alignment was padded with 'N' characters to cover the complete reference sequence. 2. For each remaining isolate sequence, the pairwise alignment was converted into a list of polymorphic positions relative to the reference genome by determining, for each nucleotide of the reference sequence, the allele present in the isolate sequence (in this representations, consecutive single-nucleotide polymorphisms or deletions relative to the reference are represented as separate reference-divergent alleles, whereas consecutive insertions are represented as one reference-divergent allele). 'N' alleles were discarded. 3. A pairwise distance matrix was computed for all remaining isolate sequences, by determining, for each pair of isolate sequences, the number of non-shared polymorphic alleles 4. For each Düsseldorf/Heinsberg isolate sequence part of the distance matrix, up to 10 "neighbouring" isolate sequences were identified from the set of other isolates sequences, employing the following steps: a) the identifiers of all samples present in the computed distance matrix were stored as a vector , excluding any Düsseldorf/Heinsberg samples; b) the order of was randomized; c) for each Düsseldorf/Heinsberg sequence identifier , a vector ′ was generated by stably sorting the elements of according to their distance to in increasing order; d) ′ was filtered to exclude any elements with distance to of more than 1; e) the first 10 elements of ′ were chosen as the neighbours of (if there were fewer than 10 elements in ′, all elements of ′ were chosen as the neighbours of ). The output of step d) is shown in Supplementary Table S4, i.e. the table shows, for all Düsseldorf/Heinsberg sequences, the IDs and country-level statistics of all possible neighbour sequences. 5. A minimum spanning tree analysis was carried out for all Düsseldorf/Heinsberg sequences part of the distance matrix, and their selected genomic neighbours, using the mst function of the R 15 package pegas 16 and visualized with functions from sna 17 .