Meta-analysis of the clinical performance of commercial SARS-CoV-2 nucleic acid and antibody tests up to 22 August 2020

Background Reliable testing for SARS-CoV-2 is key for the management of the COVID-19 pandemic. Aim We estimate diagnostic accuracy for nucleic acid and antibody tests 5 months into the COVID-19 pandemic, and compare with manufacturer-reported accuracy. Methods We reviewed the clinical performance of SARS-CoV-2 nucleic acid and antibody tests based on 93,757 test results from 151 published studies and 20,205 new test results from 12 countries in the European Union and European Economic Area (EU/EEA). Results Pooling the results and considering only results with 95% confidence interval width ≤ 5%, we found four nucleic acid tests, including one point-of-care test and three antibody tests, with a clinical sensitivity ≥ 95% for at least one target population (hospitalised, mild or asymptomatic, or unknown). Nine nucleic acid tests and 25 antibody tests, 12 of them point-of-care tests, had a clinical specificity of ≥ 98%. Three antibody tests achieved both thresholds. Evidence for nucleic acid point-of-care tests remains scarce at present, and sensitivity varied substantially. Study heterogeneity was low for eight of 14 sensitivity and 68 of 84 specificity results with confidence interval width ≤ 5%, and lower for nucleic acid tests than antibody tests. Manufacturer-reported clinical performance was significantly higher than independently assessed in 11 of 32 and four of 34 cases, respectively, for sensitivity and specificity, indicating a need for improvement in this area. Conclusion Continuous monitoring of clinical performance within more clearly defined target populations is needed.


Introduction
Testing is one of the central pillars of public health actions in epidemic and pandemic situations to allow timely identification, contact tracing and isolation of infectious cases to reduce the spread of infectious diseases. In addition, it allows estimating disease incidence, disease prevalence, and prevalence and duration of humoral immunity. Reliable testing for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and timely reporting of the data to public health authorities is therefore key for the management of the coronavirus disease  pandemic. This requires appropriate and sufficiently accurate diagnostic tests to identify individuals who are currently infected with SARS-CoV-2 as well as those who have been infected in the past. Timely access to testing, sufficient supply of testing materials, availability of tests and related reagents and consumables as well as highthroughput testing are pivotal in this context. By August 2020, a large number of commercial tests for SARS-CoV-2 RNA detection (nucleic acid tests) were available, as well as serological tests for SARS-CoV-2specific antibodies. The various types of tests can be used for different purposes and many of these tests have the CE certificate for in vitro diagnostics (CE-IVD) that indicates compliance with the European IVD directive (98/79/EC) and can thus be marketed in the countries in the European Union and European Economic Area (EU/EEA). In addition, the United States (US) Food and Drug Administration has granted emergency use authorisations for many commercial tests in the US, and the World Health Organization (WHO) maintains an emergency use listing of commercial tests [1,2]. It is, however, important to note that CE certification is based on a self-declaration of the test manufacturer, including the claims on performance of the test. Independent information on the clinical performance of these tests in terms of sensitivity and specificity is still limited, and yet this is critical for proper interpretation of results.
For this reason, the European Centre for Disease Prevention and Control (ECDC) launched a continuous call to EU/EEA countries and the United Kingdom (UK) on 1 April 2020 to provide any such clinical performance data for sharing with other countries. These data, provided by 12 countries, are presented in this article. In addition, we included publicly available data. Finally, minimal performance criteria for different intended uses were gathered from public sources and aided by a survey conducted among EU/EEA countries and the UK from 20 May to 1 June 2020.

Search strategy and selection criteria
Studies containing potentially usable data on the clinical performance of SARS-CoV-2 nucleic acid and antibody tests were first extracted from systematic reviews on this topic. We identified these reviews through an initial PubMed (Medline) search for systematic reviews and meta-analyses for '  and 'SARS-CoV-2', followed by snowballing using the 'find similar articles' feature. We extended the selection with the studies listed in the Foundation for Innovative Diagnostics database (FIND, www.finddx.org/covid-19/tests) and the European Commission COVID-19 In Vitro Diagnostic Devices and Test Methods Database (EC, https://covid- 19-diagnostics.jrc.ec.europa.eu).
Both databases attempt to exhaustively identify peer-reviewed as well as grey literature on clinical performance of COVID-19 tests and are continuously updated [3,4]. Results from the latter were further filtered for articles with a description indicating that they contain clinical performance results. We also included results produced by the US Food and Drug Administration (FDA) [5]. Finally, we searched PubMed according to the query shown in Supplement 1.
The resulting studies were subsequently assessed for eligibility. By August 2020 there were no clinical performance studies that can be judged as having low risk of bias and low applicability concerns. Systematic reviews up to that point have not used risk of bias or applicability concerns as exclusion criteria [6][7][8][9]. This was not done in this work either. Instead, we excluded studies if they did not contain data on commercial tests, or if one or more of the authors were employed by the developer or manufacturer of the index test, to avoid possible conflicts of interest. Subsequently, we also excluded studies with an ineligible design, such as blinded tests, analytical validation only, use of another threshold for positivity than in the instructions for use, comparisons between different specimen types or use of an antibody rather than nucleic acid test as reference test for any type of index test.
Further exclusions were made at sample level based on the reference test employed. Samples classified as actual negatives, i.e. used for determining specificity, had to be taken (i) before the COVID-19 outbreak, in practice before 2020, (ii) from an individual without COVID-19-compatible symptoms, or (iii) from an individual with COVID-19-compatible symptoms but who was confirmed with another respiratory illness. Samples classified as actual negatives that were taken during the outbreak and were negative according to a nucleic acid test were therefore excluded. We did this to maximally reduce misclassification as actual negatives because of known issues with sensitivity of nucleic acid tests. Such misclassified samples would artificially lower index test specificity, in particular when the index test is more sensitive than the reference test [10][11][12][13][14][15][16]. For the same reason, the reported sensitivity of nucleic acid index tests, based on a nucleic acid reference test, was considered to be a positive agreement instead, calculated as part of a head-to-head comparison between the two tests. For antibody index tests on the other hand, we considered a nucleic acid test to be a valid reference test to determine actual positive samples and sensitivity, in accordance with WHO interim guidelines [17].
Manufacturer-reported clinical sensitivity and specificity data were extracted from instructions for use where available, or otherwise from the manufacturer's website. Sensitivity results derived from contrived samples spiked with purified viral RNA were excluded.

Original clinical performance data
Primary clinical performance data generated by the COVID-19 microbiological laboratories author group were assessed by the ECDC according to the same criteria as those of the literature review.

Statistical analysis
Meta-analysis of the included clinical sensitivity and specificity results was performed per test and per target, i.e. the genomic region for nucleic acid tests and the antibody isotype for antibody tests. Antibody test  sensitivity results below the threshold number of days after onset were excluded. Sensitivity and positive agreement results were further stratified by case population as hospitalised cases, mild or asymptomatic cases, or unknown. We calculated pooled sensitivity and specificity values using fixed effects analysis, i.e. separately summing and dividing the number of correct predictions by the total number of samples in the group. Wilson score 95% confidence intervals (CI) were calculated for pooled results. Study heterogeneity was assessed through the I 2 statistic, calculated through random effects analysis using R version 4.0.2 and the metafor package [18]. We considered I 2 values < 50.0% as low heterogeneity, 50.0-74.9% as moderate and ≥ 75% as high heterogeneity.

Minimum performance criteria
By 1 June 2020, minimum performance criteria for tests were publicly available from Belgium, France, the Netherlands and the UK (Supplementary Table S1). All were applicable solely to antibody tests. The intended uses included diagnosis of COVID-19, determination of exposure to SARS-CoV-2 and determination of the immune status against SARS-CoV-2. Minimum clinical sensitivity for all of the specified intended uses ranged from 85% to 98%, with a median of 95%. These thresholds applied to samples collected at least 15 days post onset of symptoms (dpo), taking into account the time to seroconversion. Minimum clinical specificity for all of the specified intended uses was 98% in three countries and 98.5% in one. For nucleic acid confirmatory tests, the draft WHO Target Product Profiles for priority diagnostics to support response to the COVID-19 pandemic state > 95% to > 98% sensitivity (acceptable/ desired) and > 99% specificity [19].
We used general thresholds of > 95% sensitivity and > 98% specificity to determine if a test met the minimum performance criteria, together with a maximum 95% CI width ≤ 5%. For results on IgM antibodies only, an upper limit of ≤ 28 dpo, or the highest dpo category with an upper limit ≤ 28 dpo, was added since IgM antibodies decrease fairly rapidly and such tests are not intended to be used long after exposure [20]. These sensitivity and specificity thresholds can be converted to false positives (FP) and negatives (FN), and positive and negative predictive value (PPV, NPV) if the prevalence of the condition, i.e. SARS-CoV-2 nucleic acid or antibody positivity, is known. These metrics better express the real impact of the accuracy. For a hypothetical low prevalence of 1% in a population of 100,000 people, the PPV would be > 32.4% (FP < 1,980) and NPV > 99.9% (FN < 50). For a high prevalence of 5%, these values would be > 71.4% (FP < 1,900) and > 99.7% (FN < 250). Finally, for a high prevalence of 30%, PPV would be > 95.3% (FP < 1,400) and NPV > 97.9% (FN < 1,500).

Primary clinical performance data
We identified eight systematic reviews, including one by health technology assessment bodies not listed as a peer-reviewed study, and included the primary studies they were based on [6][7][8][9][21][22][23][24]. The full list of studies in the FIND and EC databases was retrieved on 22 August 2020. PubMed was searched on the same date. From the EC database, 268 of 385 studies were screened out because their description did not indicate that they contained clinical performance data on commercial tests. Of the remaining 117 studies, 81 were not present in the FIND database and 82 were not present in the EC database. From the PubMed results, 1,520 of 1,738 studies were screened out. From the combined list of 364 unique studies, 105 had no clinical performance data on commercial nucleic acid or antibody tests, 34 were excluded because of a potential conflict of interest and 74 were excluded because of ineligible design, leaving a total of 151 included studies. Of those, 53 were exclusively found through the Pubmed search and 15 in the FIND database. The remaining studies were listed by at least two sources.
A complete overview of the study selection is given in Figure 1. After exclusion of antibody test sensitivity results ≤ 14 dpo and ineligible specificity results, a total of 37,435 and 56,322 index test results remained for calculation of sensitivity and specificity, respectively.
After addition of original, previously unpublished results provided by the authors of this study, this increased to 47,543 and 66,419 index test results, respectively, for 198 tests. A descriptive overview of the number of studies and results per country in given in Table 1. A complete overview of the studies is given in Supplementary Tables S2-S4.

Meta-analysis
Pooled estimates for clinical sensitivity and specificity per test, target and, for sensitivity, case population were made. For antibody tests, we restricted the results to those estimates that had a 95% CI width ≤ 5% and were derived from at least two studies, to be able to assess study heterogeneity. Based on the minimum performance criteria analysis, results ≥ 95% sensitivity and/or ≥ 98% specificity for a particular population are highlighted in Table 2. Among these results, there were two CLIA, one ELISA and no LFIA/POC that had ≥ 95% sensitivity and nine CLIA, four ELISAs and 12 LFIA/ POC that had ≥ 98% specificity, including the three with ≥ 95% sensitivity. Study heterogeneity was low for four of 10 sensitivity and 53 of 69 specificity results with CI width ≤ 5%. There were few sensitivity results for IgG for mild or asymptomatic cases, for IgA and for total antibody, none of which had a CI width ≤ 5%. In four cases where the same test was used for hospitalised cases, a reduction in sensitivity was observed of 7.4%, 11.0%, 13.1% and 19.2% for IgG (Table 2). For IgA and total antibody, data were available for only one test each. A reduction of 28.8% was observed for IgA and an increase of 6.0% for total antibody. The latter increase was probably due to the small number of samples for both populations.
For nucleic acid tests, results were restricted as for antibody tests (Table 3). Four tests, including one POC, had ≥ 95% positive agreement with a CI width ≤ 5%, and nine had ≥ 98% specificity. Study heterogeneity was low for all five sensitivity and all 15 specificity results with CI width ≤ 5%.
The correlation between independently assessed clinical performance results and manufacturer-reported results is shown in Figure 2. The manufacturer-reported documents are listed in Supplementary Table S2. Only independently assessed results with CI width ≤ 5% are included. A total of 11 of 32 sensitivity and four of 33 specificity results reported by the manufacturer were significantly larger (p < 0.05).

Discussion
This review presents a comprehensive independent overview of clinical performance of commercially available nucleic acid and antibody tests 5 months into the COVID-19 pandemic. A substantial amount of previously unpublished data from European countries are included as well. By August 2020, there are numerous commercial tests for which sufficient performance data are available to allow calculation of clinical sensitivity or positive agreement, and specificity with narrow    (3), CA, CH (2), DE (6), DK (2), ES, FI (2), FR (3), GR, LU, NL (2) (2) (2), CN (2)  confidence interval ranges. It is reassuring that the clinical performance of several nucleic acid and antibody tests exceeded the minimum performance criteria. As time progresses, the list of tests with sufficient available performance data is expected to grow.
At the same time, the available evidence for point-ofcare nucleic acid and antigen tests remains scarce, even though these tests can have substantial practical advantages for e.g. screening. We therefore recommend more emphasis on the validation of these tests, including as part of a testing algorithm, whereby the sensitivity and specificity of taking two tests with a number of days in between is assessed, and which can for example be useful to reduce the duration of a quarantine period.
The comparison between the independently assessed clinical performance data and manufacturer-reported clinical performance revealed that in particular sensitivity is frequently (34.4% of the cases in this study) significantly overestimated by the manufacturer. At a minimum, this emphasises that such independent assessments are clearly necessary. In the longer term, an explicit and proactive regulatory mechanism in Europe to compare available independently generated evidence on these tests against the manufacturerreported values, coupled with appropriate regulatory action, would be useful. This could also be rewarding towards those manufacturers that do provide robust estimates of their product's performance. The new in vitro diagnostic medical devices Regulation (EU) 2017/746 (IVDR), which will enter into force in May 2022, will impose more stringent requirements on clinical performance studies done by manufacturers. In addition, the IVDR will also regulate the use of lab-developed tests such as the in-house PCR tests developed for COVID-19 [25]. Because of the COVID-19 pandemic, the European Commission has recently proposed to modify the roll-out [26].
Limitations of our article include that most of the included studies had a substantial risk of bias in the sample selection, especially for the sensitivity panel, as established also in the assessments performed in the systematic reviews that we used as a source. Results were mainly based on hospitalised cases or poorly defined populations, whereas the population of interest often consists of symptomatic cases in general, or even asymptomatic cases, and differences in performance may exist depending on disease severity. Performance also varies depending on the type of specimen used, and our study design allowed for the inclusion of multiple specimen types in accordance with the instructions for use. This reflected to some extent clinical practice, but is also a contributing factor to study heterogeneity that we did not address here. Similarly, the pre-analytical steps such as RNA extraction can

Table 3b
Pooled positive agreement and specificity results for SARS-CoV-2 nucleic acid tests with confidence interval width ≤ 5% for either or both and based on at least two studies, up to 22 August 2020 have a substantial effect on performance. These are often not specified in detail or several processes may be allowed according to the instructions for use, which can have contributed to study heterogeneity. While this review addresses a pressing need for actionable clinical performance data, ideally, the clinical performance should be assessed through prospective studies or clinical trials with a guaranteed unbiased sample selection for a clearly defined target population and intended use of the test. Given the difficulty of assessing and extracting the data from individual studies in a coherent way, we recommend that the Standard for Reporting of Diagnostic Accuracy Studies (STARD) should also be followed when publishing the results [27].
In this context, the selection of the reference test is particularly important with respect to reference negative samples. As described in some of the assessed studies, it should be avoided that index test results are considered as false positives while the samples are from actual cases; for this reason we excluded nucleic acid-negative samples from suspected COVID-19 patients altogether. We therefore expect little bias in the specificity results, except potentially from underor overrepresentation of confounders. This is especially relevant for seroprevalence studies where, in a lowprevalence situation, in particular the specificity of the test needs to be well defined and high. On the other hand, sensitivity results using a nucleic acid test as reference should be interpreted with caution because the positive samples may exclude some actual cases.
Possibilities to improve the reference test can include testing -potentially only the false positives -with a second reference nucleic acid test preferably targeting different genes, testing more than one sample from the same patient including for antibodies at a later time point, testing samples from both upper and lower respiratory tracts, and sequencing the sample. The handling of intermediate index test results is an issue that needs to be described in studies and in general, these should be considered as positive results rather than as negatives or excluding them from the validation, since in clinical practice they would normally require further follow-up to confirm the positivity of the sample.
Finally, the quality of the execution of the tests is also an important factor. For non-point-of-care tests, external quality assessment exercises using well validated standard reference materials remain a critical tool to detect and address such issues.

Conclusion
Given the study limitations, the authors and organisations contributing to this study in no way recommend the use of the listed commercial tests over other not listed commercial or in-house tests. The clinical performance of tests may also change over time as the virus population evolves. We recommend, however, continuous monitoring of clinical performance both in Europe and globally, which is key for reliable monitoring of the pandemic and which will also support vaccine and antiviral development. These results should be shared publicly in a timely manner.

Figure 2
Independently assessed vs manufacturer-reported clinical sensitivity and specificity per SARS-CoV-2 test, up to 22 August 2020 (n = 55)