Comparative sensitivity evaluation for 122 CE-marked rapid diagnostic tests for SARS-CoV-2 antigen, Germany, September 2020 to April 2021

Introduction Numerous CE-marked SARS-CoV-2 antigen rapid diagnostic tests (Ag RDT) are offered in Europe, several of them with unconfirmed quality claims. Aim We performed an independent head-to-head evaluation of the sensitivity of SARS-CoV-2 Ag RDT offered in Germany. Methods We addressed the sensitivity of 122 Ag RDT in direct comparison using a common evaluation panel comprised of 50 specimens. Minimum sensitivity of 75% for panel specimens with a PCR quantification cycle (Cq) ≤ 25 was used to identify Ag RDT eligible for reimbursement in the German healthcare system. Results The sensitivity of different SARS-CoV-2 Ag RDT varied over a wide range. The sensitivity limit of 75% for panel members with Cq ≤ 25 was met by 96 of the 122 tests evaluated; 26 tests exhibited lower sensitivity, few of which failed completely. Some RDT exhibited high sensitivity, e.g. 97.5 % for Cq < 30. Conclusions This comparative evaluation succeeded in distinguishing less sensitive from better performing Ag RDT. Most of the evaluated Ag RDT appeared to be suitable for fast identification of acute infections associated with high viral loads. Market access of SARS-CoV-2 Ag RDT should be based on minimal requirements for sensitivity and specificity.


Introduction
A large number of antigen-detecting rapid diagnostic tests (Ag RDT) for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are available on the European market, both for professional use and as self-tests. Rapid tests are based on lateral flow immunochromatography using antibodies against SARS-CoV-2 proteins (antigens), present in respiratory tract specimens. By far most Ag RDT target the viral nucleoprotein, only very few assays work with spike protein detection. Viral variants of concern (VOC) contain mainly mutations in the gene encoding the spike protein, leaving the vast majority of SARS-CoV2 Ag RDT unaffected; however, the few SARS-CoV-2 Ag RDT based on spike protein detection should be checked at regular intervals for potential deficiencies. While PCR is still the gold standard for virus detection, there is increasing evidence that infectivity of respiratory secretions correlates with high viral loads present in the early phase of infection, e.g. before and 0-10 days after onset of symptoms. In addition to more complex and time-consuming PCR systems, Ag RDT allow rapid identification of acutely infected and potentially infectious individuals facilitating fast decisions on containment of virus spread, patient care, isolation and contact tracing [1,2]. Furthermore, Ag RDT may save limited reagents of more sensitive molecular diagnostics to serve other diagnostic needs, e.g. disease management or confirmation of Ag RDT reactive results.
In the European Union (EU), regulatory requirements for SARS-CoV-2 in vitro diagnostic medical devices (IVD) are defined by the IVD Directive 98/79/EC (IVDD) and have to be addressed by the manufacturer before access to the EU Common Market [3]. However, certification (CE marking) of SARS-CoV-2 diagnostics is currently done solely by the manufacturer (self-certification), without third party intervention. The exception are SARS-CoV-2 self-tests, where a notified body has to assess studies with lay persons performing the tests. However, owing to the urgency in the coronavirus disease (COVID-19) situation, a national derogation for CE-certification of self-tests can be agreed by the national competent authority, e. g. by relying on the performance of the same RDT cassette offered for professional use. Starting from May 2022, the IVDD will be replaced by the IVD Regulation (EU) 2017/746 (IVDR) where a risk-based classification of IVD is the basis for the scrutiny of their assessment [4,5]. The SARS-CoV-2 IVD will belong to the high-risk devices (class D) under the IVDR, requiring a notified body both for certification of the manufacturer`s quality management system and for assessment of the technical documentation of the device. Furthermore, EU reference laboratories (EURL) will be responsible for independent laboratory testing of class D devices to verify performance features and to assure batch-to-batch consistency [4]. However, for the time being, independent evaluations of SARS-CoV-2 Ag RDT that allow conclusions on their performance are largely missing.
In the current situation with absence of strict regulatory requirements for most SARS-CoV-2 IVD, the German Ministry of Health decided to link the reimbursement of SARS-CoV-2 Ag RDT to provision of evidence of essential quality features of these assays. This evidence consisted of two parts: (i) compliance with minimum criteria for RDT sensitivity (detection of > 80% of PCR-positive symptomatic patients during the first 7 days after symptom onset) and specificity (> 97% for asymptomatic persons) in studies performed by or on behalf of the manufacturer with clinical specimens, and (ii) successful outcome in the independent laboratory evaluation. Minimum criteria were jointly defined by the Paul-Ehrlich Institute (PEI) and the Robert Koch Institute (RKI), two governmental authorities in Germany [6]. Manufacturers or distributors of SARS-CoV-2 Ag RDT document for the respective SARS-CoV-2 Ag RDT compliance with these criteria before the device can be listed as eligible for reimbursement on a dedicated webpage of Bundesinstitut für Arzneimittel und Medizinprodukte (BfArM), another governmental authority [7].
We selected devices from the BfArM list for the comparative evaluation performed by PEI/RKI. The aim of this comparative evaluation was to both determine the state of the art sensitivity of proficient devices and identify devices not reaching the minimum sensitivity level. The concept of 'state of the art' is also mentioned in the IVD Regulation (EU) 2017/746 (IVDR) [5], describing a defined level of quality features achieved by the majority of assays at a certain time point after their comparative evaluation using an uniform sample set (head-to-head comparison); with continuous improvement of devices, the state of the art level increases over time and would therefore need to be reassessed at certain intervals. Subsequently, devices with sensitivity below state of the art are removed from the BfArM list while all devices with successful evaluation outcome are published on the PEI webpage [8]. We evaluated 122 SARS-CoV-2 Ag-RDT in direct comparison using a common panel of SARS-CoV-2 specimens.

Evaluation panel
Detailed characterisation of the evaluation panel has been described by Puyskens et al., published in this issue of Eurosurveillance [9]. In short, pools from nasopharyngeal and oropharyngeal swabs from SARS-CoV-2-positive individuals were prepared as random mixtures obtained from up to 10 swabs. While dry swabs were directly eluted in phosphate-buffered saline (PBS), the residual amount of virus transport media (VTM) contained in moist swabs was diluted in PBS. Care was taken not to use VTM containing the protein-denaturing component guanidinium.
Individual pools were composed of samples with similar SARS-CoV-2 concentrations, expressed as quantification cycle (Cq) values of semiquantitative PCR. In total 50 different pools were defined as members of the evaluation panel and stored as 500 µl aliquots at −80 °C. The Cq of each panel member was determined by PCR, and the putative number of RNA copies calculated with the aid of the reference preparations distributed by the German external quality assessment (EQA) provider INSTAND e. V [10]. Furthermore, presence of infectious virus detectable by successful propagation in Vero cell culture was investigated for the individual pools, and results were widely in line with published findings that in vitro infectivity corresponds with virus concentrations of Cq ≤ 25 [11][12][13]. This finding is widely confirmed in our study with nine of 17 and three of 18 members of the two panel versions 1V1 and 1V2, respectively (see Supplement: Design and manufacture of the evaluation panel). However, there is no established Cq cut-off value at which individuals are estimated to be no longer infectious.

Antigen stability
Real-time antigen stability in panel members was investigated at the PEI using quantitative SARS-CoV-2 ELISA Lumipulse G SARS-CoV-2 Ag (Fujirebio Inc., Shinjuku-ku, Tokyo, Japan). Panel members were

Table 1C
Comparative evaluation results of SARS-CoV-2 antigen rapid diagnostic tests passing the sensitivity criteria (in alphabetical order of manufacturers), Germany, 2020-2021 (n = 96) tested after initial thawing and 1 week incubation at 4 °C. Furthermore, potential impact of an additional freeze/thaw cycle was addressed.

Comparative evaluation
In the beginning of the comparative evaluation, participating laboratories included those at the RKI, the PEI, the Nationales Konsiliarlaboratorium für Coronaviren (Institute of Virology, Charité), the Bundeswehr Institute of Microbiology, the Bernhard-Nocht-Institut für Tropenmedizin and laboratories of the association Akkreditierte Labors in der Medizin (ALM). At a later stage, because of the increasing work load, the evaluation was continued by PEI and RKI. Panels were shipped on dry ice and, once thawed, 50 µL aliquots were prepared, kept at 4 °C and used within 5 days, without further freeze/thaw step. For each Ag RDT and panel member, the 50 µL aliquot was completely absorbed using the specimen collection device, e.g. a swab, provided with the respective test. The swabs were then eluted in the test-specific buffer, strictly following the respective instructions for use (IFU

Characterisation of the evaluation panel
Panel members spanned the Cq range between 17 and 36. A specimen with an assigned SARS-CoV-2 RNA concentration of 10 6 RNA copies/mL provided by INSTAND corresponded to the Cq value of 25. Assuming that a Cq difference of 1 corresponds to a concentration factor of 2, and taking into account that the individual panel members covered a Cq range from 17 to 36, the SARS-CoV-2 RNA amounts in the panel covered a concentration range from > 10 8 to < 10 3 copies per mL, respectively. The Cq values of 20 or 30 therefore corresponded to approximate SARS-CoV-2 RNA concentrations of 3 × 10 7 or 3 × 10 4 copies per mL, respectively. SARS-CoV-2 propagation in cell culture resulted in positive results for several of the low Cq/ high-titre specimens, indicating presence of infectious virus despite the various preparation steps (more details in [9] and in the Supplement: Design and manufacture of the evaluation panel).
Investigation of stability of the analyte SARS-CoV-2 antigen in panel members revealed a negative effect for additional freeze/thaw steps; in contrast, there was no obvious impact on the antigen content after 7 days experimental storage at 4 °C of the liquid 50 µL aliquots (data not shown). From one 500 µL thawed aliquot, routinely nine to 10 aliquots of 50 µL were immediately filled and used within 5 days for evaluation of nine to 10 RDT, respectively, ensuring that there were no stability issues.

Comparative evaluation
We  Table 3 lists the subgroups based on performance data.
The 96 tests meeting the sensitivity criteria were reactive with between 14 and 41 members of the 50 members panel (see Supplementary Figure S1). On average throughout all successful tests, 27 panel members (54%) were reactive. Overall reactivity of SARS-CoV-2 Ag RDT strongly followed the analyte concentration throughout the panel, confirming the design of this study (see Supplementary Figures S1 and S2).
The 26 SARS-CoV-2 Ag RDT missing the sensitivity criteria either failed completely (two tests with 0 reactives) or were reactive with two to 12 (average: 6.3) panel members. Again, reactivity was dependent on the analyte concentration throughout the panel members (see Supplementary Figure S2). Two further tests failed because of constant faint background reactivity throughout all panel members; this background reactivity was also seen when using pure extraction buffer and was thus not caused by the panel composition (data not shown). According to information provided by the RDT manufacturers, nucleoprotein is used as target antigen in 112, spike protein in three (Table 1: RDT no.78; Table 2: RDT no. 107 and 109) and both nucleoprotein and spike protein in two assays (Table 1: RDT no. 43 and 64). For five assays, information on the target antigen was not available. Although two of the five assays detecting spike protein failed in this evaluation, the number is too small to conclude on potential association between chosen target antigen and RDT performance.

Discussion
There is convincing evidence that infectivity of SARS-CoV-2 correlates directly with high viral loads in respiratory specimens of acutely infected persons [11][12][13]. It has therefore been suggested in many countries to use antigen tests to detect potential infectivity and help control the spread of infection rather than for the purpose of clinical diagnosis. Thus Ag RDT have become a key part of testing strategies since the autumn of 2020. Hundreds of different Ag RDT, most    Lack of independent evaluation combined with unjustified statements of quality features led the German Ministry of Health to request in autumn 2020 a comparative evaluation of the sensitivity of test kits offered in Germany. At the time we performed our study (autumn 2020 to spring 2021), there were no EU-wide requirements for quality features of COVID-19 IVD such as a defined minimum sensitivity or minimum specificity, and manufacturers may themselves certify their devices as compliant with basic requirements of the IVDD. Therefore, it was mainly left to individual countries or international organisations to define minimum requirements for the acceptance of tests. However, in autumn 2021, a Guidance on performance evaluation of SARS-CoV-2 in vitro diagnostic medical devices was endorsed by the EU Medical Device Coordination Group which will be the basis for future Common Specifications of the IVD Regulation (EU) 2017/746 [14].
In Germany, the Ministry of Health decided to link the reimbursement of SARS-CoV-2 Ag RDT to quality requirements that needed to be fulfilled by acceptable devices. Minimum requirements were jointly formulated by the PEI and RKI and state for SARS-CoV-2 Ag RDT a minimum sensitivity of 80% for PCR-positive specimens obtained within the first 7 days after symptom onset; the minimum specificity was defined as > 97%, and for both requirements, a study population of at least 100 persons is required [6]. Analogous requirements for SARS-CoV-2 Ag RDT have been proposed by the World Health Organization (WHO) for the emergency use listing [15], the United States Food and Drug Administration [16], the European Centre for Disease Prevention and Control [17], the Swiss Authority Bundesamt für Gesundheit [18] and the non-governmental Foundation for Innovative New Diagnostics [19]. Furthermore, RDT reimbursable in Germany had to pass our comparative evaluation, the first part of which is summarised in this manuscript; the evaluation has been continued for further RDT with an equivalent panel version 3 (data not included in this manuscript).
The definition of 75% minimal detection rate (analytical sensitivity) for panel members with Cq ≤ 25 in our comparative evaluation was based on different reasons. Firstly, infectious virus determined by cell culture propagation was reported for specimens with virus concentrations corresponding to an RNA level of around 10 6 copies/mL and higher [11][12][13] (Supplement). Secondly, early in the evaluation, this limit proved to differentiate between RDT with different levels of analytical sensitivity, widely in accordance with diagnostic sensitivity determined in independent SARS-CoV-2 RDT evaluations using clinical specimens [19,20]. Thirdly, there is intrinsic variation between different nucleic acid amplification tests with regard to reported Cq values because of assay-specific nucleic acid extraction/ elution volumes combined with assay-specific amplification input volume and amplification efficacy. This fact explains the urgent need for standardisation in this field using a common reference preparation, e.g. the WHO International Standard (IS), in combination with common unitage reporting, e.g. international units associated with the WHO IS [21]. Finally, Cq values of our panel members are not directly comparable to those of clinical specimens in other studies: we quantified the panel members by pipetting an aliquot into the amplification reaction while viral RNA in clinical specimens is measured after its elution from swabs, with probable swab-dependant retention of viral compounds, as described in Puyskens et al [9].
We recognise as potential limitation that clinical specimens are defined according time point of symptom onset and may not necessarily reflect the same viral load pattern as in our panel. We followed routine use of the tests as far as possible, including pre-analytical steps such as antigen absorption using the test-specific swabs, and subsequent elution into the test-specific buffer. Although this procedure does not follow the IFU exactly, we estimate that it is very close to the routine steps prescribed in the IFU of each test for processing clinical specimens. The vast majority (79%) of Ag RDT included in our study showed sufficient sensitivity according to our criteria. Nevertheless, the results showed a wide range of varying sensitivity. There were few tests with high and many tests with sufficient sensitivity, but also quite a few tests (21%) that did not meet the minimum criterion. Our study shows that the majority of SARS-CoV-2 Ag RDT correctly identify high viral loads of Cq ≤ 25 (> 10 6 virus RNA copies/mL) in samples from the respiratory tract with a sensitivity of > 75%, supporting their use in the early symptomatic phase. However, although sensitivity declined with Cq > 25, there were few SARS-CoV-2 Ag RDT (4/122; 3.3%) with highest sensitivity: 97.5% for Cq < 30 or up to 86% for the complete Cq range (Cq 17-36).
There are scientific publications of further independent head-to-head evaluations for SARS-CoV-2 Ag RDT which, at the time of writing this manuscript, were limited to the comparison of only few tests [22][23][24][25][26][27][28].
Respective conclusions based on clinical specimens are widely consistent with our results, and the sensitivity ranking of different tests was often in line with our evaluation panel. For a valid comparison between different RDT, it is essential to follow the instructions for use, including the use of the swabs provided with the specific RDT, potentially impacting the release of virus compounds into the elution buffer (see also [9]). This precondition is not always fulfilled by studies comparing different RDT.
Since most of the SARS-CoV-2 Ag RDT offered in Europe are provided without a read-out device, visual interpretation of test results is indispensable. We would like to emphasise that few discrepant tests results obtained by two experienced laboratory technicians were reported. These equivocal results were ultimately interpreted as reactive, in favour of the tests under investigation. However, visual read-out and subjective interpretation of faint test lines, potentially caused by borderline concentration of the analyte, presents a challenge for less experienced users, e.g. lay persons using Ag RDT as self-tests.
A limitation of this study is its spot check nature since it cannot address variations between different batches of the same product, or variations between different test locations (see also [9]).

Conclusion
By using the same panel for a large number of different SARS-CoV-2 Ag RDT, we were able to evaluate the comparative performance of the different tests under the same conditions. The evaluation panel proved to be of appropriate design for sensitivity differentiation of SARS-CoV-2 Ag RDT, distinguishing better performing from less suitable tests. The continuation of the comparative evaluation is needed to cope with the rapidly growing market of SARS-CoV-2 Ag RDT. Since the panel is now close to exhaustion, we will continue the evaluation with a new set of samples with similar features, calibrated against the current panel. Although the study has not been performed with individual clinical samples, the respective limitation may be small because of the concept to use pooled specimens from clinical samples; we are confident that the results reflect well pre-analytical and analytical features of the RDT.