Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data, March 2020

Background Estimating key infectious disease parameters from the coronavirus disease (COVID-19) outbreak is essential for modelling studies and guiding intervention strategies. Aim We estimate the generation interval, serial interval, proportion of pre-symptomatic transmission and effective reproduction number of COVID-19. We illustrate that reproduction numbers calculated based on serial interval estimates can be biased. Methods We used outbreak data from clusters in Singapore and Tianjin, China to estimate the generation interval from symptom onset data while acknowledging uncertainty about the incubation period distribution and the underlying transmission network. From those estimates, we obtained the serial interval, proportions of pre-symptomatic transmission and reproduction numbers. Results The mean generation interval was 5.20 days (95% credible interval (CrI): 3.78–6.78) for Singapore and 3.95 days (95% CrI: 3.01–4.91) for Tianjin. The proportion of pre-symptomatic transmission was 48% (95% CrI: 32–67) for Singapore and 62% (95% CrI: 50–76) for Tianjin. Reproduction number estimates based on the generation interval distribution were slightly higher than those based on the serial interval distribution. Sensitivity analyses showed that estimating these quantities from outbreak data requires detailed contact tracing information. Conclusion High estimates of the proportion of pre-symptomatic transmission imply that case finding and contact tracing need to be supplemented by physical distancing measures in order to control the COVID-19 outbreak. Notably, quarantine and other containment measures were already in place at the time of data collection, which may inflate the proportion of infections from pre-symptomatic individuals.


Introduction
The 2019 coronavirus disease (COVID-19) outbreak that started in Wuhan, China in December 2019 has now been declared a pandemic. As at 22 April 2020, 2,573,143 cases of COVID-19 have been confirmed in 185 countries and territories around the world [1]. In order to plan intervention strategies aimed at bringing disease outbreaks such as the COVID-19 outbreak under control as well as to monitor disease outbreaks, public health officials depend on insights about key disease transmission parameters that are typically obtained from mathematical or statistical modelling. Examples of key parameters include the reproduction number (R) (average number of infections caused by an infectious individual), and distributions of the generation interval (time between infection events in an infector-infectee pair), serial interval (time between symptom onsets in an infector-infectee pair) and incubation period (time between moment of infection and symptom onset) [2]. Estimates of the reproduction number together with the generation interval distribution can provide insight into the speed with which a disease will spread. On the other hand, estimates of the incubation period distribution can help guide determining appropriate quarantine periods.
When the incubation period does not change over the course of the epidemic, the expected values of the serial and generation interval distributions are expected to be equal but their variances to be different [8]. It has recently been shown that ignoring the difference between the serial and generation interval can lead to biased estimates of the reproduction number [8]. More specifically, when the serial interval distribution has larger variance than the generation interval distribution, using the serial interval as a proxy for the generation interval will lead to an underestimation of the effective reproduction number, R. When R is underestimated, this may lead to prevention policies that are insufficient to stop disease spread [8].
The most well-known method to estimate the serial interval distribution from line list data is the likelihood-based estimation method proposed by Wallinga and Teunis [9]. In 2012, Hens et al. [10] proposed using the expectation-maximisation (EM) algorithm to estimate the generation interval distribution from incomplete line list data based on the method by [9] and allowing for auxiliary information to be used in assigning potential infector-infectee pairs. Te Beest et al. [11] used a Markov chain Monte Carlo (MCMC) approach as an alternative to the EM algorithm, to facilitate taking uncertainty related to the dates of symptom onset into account. In this paper, we use a MCMC approach to estimate, next to the serial interval distribution, the generation interval distribution upon specification of the incubation period distribution. We compare the impact of differences among previous estimates of the incubation period distribution for COVID-19.

Data sources
The data used in this paper are symptom onset dates and cluster information for confirmed cases in Singapore (21 January to 26 February 2020) and Tianjin, China (14 January to 27 February 2020).
As at 26 February, 91 confirmed COVID-19 cases had been reported in Singapore. Detailed information on age, sex, known travel history, time of symptom onset and known contacts was available for 54 of these cases from the Ministry of Health (https://www.moh.gov.sg/ news-highlights/, last accessed 26 February). For cases with no infector information available, it was assumed that they could have been infected by any other case within the same cluster. There were four clusters in these data, i.e. Grace Assembly of God church, Grand Hyatt business meeting, Seletar Aerospace Heights construction site and Yong Thai Hang shop. Cases known to be Chinese/Wuhan nationals or known to have been in close contact with a Chinese/Wuhan national were labelled as index cases. All other cases were assumed to have been infected locally.
As at 27 February, 135 confirmed cases had been reported by the Tianjin Municipal Health Commission. Data on these cases were available in official daily reports (http://www.tjbd.gov.cn/zjbd/gsgg/, last accessed 27 February) and included age, sex, relationship to other known cases, and travel history to risk areas in and outside Hubei Province, China. In these data, 114 cases can be traced to one of 16 clusters. The largest cluster consisting of 45 cases could be traced to a shopping mall in Baodi district of Tianjin. Through contact investigations, potential transmission links were identified for cases who had close contacts. Travel history information was used to identify some individuals as imported cases. For cases with no infector information available, it was assumed that they could have been infected by any other case within the same cluster.

Model
Assuming the incubation period is independent of the infection time, Z i can be rewritten as a convolution of the generation interval for individual i and the difference between the incubation period of individual i and the incubation period of its infector v(i) [8], i.e., The random variables X i and δ i are positive and are both assumed to be independent and identically distributed, i.e. X i ~ f(x; Θ 1 ) and δ i ~ k(δ; Θ 2 ), so that Y i ~ g(y i ; Θ 2 ). Formula (1) implies that both the generation interval and serial interval distributions have the same mean and that the latter has a larger variance and can be negative.
The observed serial interval, z i , can be expressed in terms of the latent variables as z i = x i + y i , which implies that, z i ~ h(z i ; Θ 1 , Θ 2 ). The density function h(.) is given by Mood et al. [12], In general, h(z;Θ 1 , Θ 2 ) and g(y; Θ 2 ) have no closed form for arbitrary choices of f(x; Θ 1 ) and k(δ; Θ 2 ). Monte Carlo methods [13] can be used to estimate h(z; Θ 1 , Θ 2 ) as follows, where J is the number of Monte Carlo samples (i.e. 300) and y j is the j th Monte Carlo sample drawn from g(y; Θ 2 ). When all infector-infectee pairs are observed, the likelihood function is given by, To account for uncertainty in the transmission links we resort to a Bayesian framework in which missing links are imputed [11] (see the following section, 'Parameter estimation'). The likelihood function is then given by L (Θ,v(i) missing |z i , v(i)). In the main analyses missing links v(i) missing are imputed allowing for positive serial intervals only. As a sensitivity analysis, we do not impose any constraints on whether or not serial intervals have a positive value.

Parameter estimation
We use the Bayesian method described in te Beest et al. [11] for parameter estimation. This method proceeds in two steps. The first step updates the missing links v(i) missing and the second step updates the parameter vector Θ 1 , i.e. the parameters of the generation interval distribution. We assume that both the generation interval and the incubation period are gamma distributed, i.e. f(x; Θ 1 ) ≡ Γ(α 1 , β 1 ) and k(δ; Θ 2 ) ≡ Γ(α 2 , β 2 ). The parameter vector Θ 2 is fixed to (α 2 = 3.45; β 2 = 0.66), corresponding to an incubation period with a mean of 5.2 days and a standard deviation (SD) of 2.8 days [6]. Minimally informative uniform priors are assigned to the parameters of the generation interval distribution, i.e. α 1 ~ U(0,30) and β 1 ~ U(0,20). For cases with multiple potential infectors, the possible links v(i) missing are assigned equal prior probabilities. The missing links are updated using an independence sampler, whereas Θ 1 is updated using a random-walk Metropolis-Hastings algorithm with a uniform proposal distribution [13]. We evaluate the posterior distribution using 3,000,000 iterations of which the first 500,000 are discarded as burn-in. Thinning is applied by taking every 200th iteration. The mean and variance of the generation interval distribution are monitored within the MCMC chain. Posterior point estimates are given by the 50% percentiles of the converged MCMC chain. CrIs are given by the 2.5% and 97.5% percentiles of the converged MCMC chain. The serial interval distribution is obtained by simulating 1,000,000 draws from h(z; Θ 1 , Θ 2 ). All analyses were performed using R software version 3.6.2 (R Foundation, Vienna, Austria), while datasets and code are available on GitHub (https://github.com/cecilekremer/COVID19).

Corollary epidemiological parameters
The Figure shows three possible transmission scenarios. The proportion of pre-symptomatic transmission is calculated as p = P(X i < δ v(i) ), i.e. pre-symptomatic transmission occurs when the generation interval is shorter than the incubation period of the infector. This proportion was obtained by simulating values from the estimated generation interval and incubation period distributions, assuming a mean incubation time of 5.2 days [6].
For each of the two outbreaks, i.e. Singapore and Tianjin, R is calculated as In this, r denotes the exponential growth rate estimated from the early ascending phase of the incidence curve, and μ and σ 2 are the mean and variance of either the generation interval distribution or the serial interval distribution [14]. We calculate R in order to highlight the bias that occurs when the serial interval distribution is used as a proxy for the generation interval distribution [8].
CrIs for p and R are calculated by evaluating p and R at each iteration of the converged MCMC chain, i.e. at each mean-variance pair of the posterior generation/ serial interval distribution. The 95% CrIs are given by the 2.5% and 97.5% percentiles of the resulting distributions.

Sensitivity analyses
As sensitivity analyses, we investigate the robustness of our estimates of the generation interval distribution to the choice of different incubation period distributions. In particular, we fix Θ 2 to (α 2 = 7.74; β 2 = 1.21) and (α 2 = 4.36; β 2 = 0.91), corresponding to an incubation period with a mean of 6.4 and SD of 2.3 days [4], and a mean of 4.8 and a SD 2.6 days [7], respectively.
In our main, i.e. baseline, analyses, missing serial intervals were only allowed to be positive, i.e. the symptom onset time of the infector has to occur before that of the infectee. However, given that pre-symptomatic transmission is possible, this can be deemed an unrealistic assumption. Therefore, we assess the impact of allowing for negative serial intervals on our estimates of the generation interval distribution.
To further assess the robustness of the estimated generation interval distribution, for each dataset, we fit the model to data from the largest cluster. In the Tianjin dataset, the largest cluster is the shopping mall cluster consisting of 45 cases. In the Singapore dataset, this is the Grace Assembly of God cluster consisting of 25 cases. Table 1 shows parameter estimates of the generation and serial interval distributions for each dataset, assuming an incubation period with a mean of 5.2 days and a SD of 2.8 days. The mean generation time is estimated to be 5.2 days (95% CI: 3.78-6.78) for the Singapore data, and 3.95 days (95% CI: 3.01-4.91) for the Tianjin data. As expected, the estimated means of the generation interval and serial interval distributions are approximately equal, but the latter has a larger variance. Table 2 shows parameter estimates of the generation and serial interval distributions for each dataset, assuming incubation periods with a mean of 6.4 and a SD of 2.3 days, or a mean of 4.8 and SD of 2.6 days. The parameter estimates are fairly robust to the specified incubation period distribution, with mean generation times of about 5 days for Singapore and 4 days for Tianjin. Table 3 shows parameter estimates of the generation and serial interval distributions obtained when allowing for negative serial intervals in case there is no known infector. Compared with baseline analyses ( Table 1), estimates of the mean generation time are smaller when allowing for negative serial intervals. The mean generation time is 3.86 days for Singapore and 2.90 days for Tianjin. Table 4 shows parameter estimates obtained when we fit the model to data from the largest cluster (n = 45). We only show results for the Tianjin dataset because for the Singapore data, there were too few cases (n = 25) and the MCMC chain did not converge. When allowing only positive serial intervals for cases with no known infector, the mean generation time is estimated to be 3.50 days. On the other hand, when allowing for negative serial intervals, it is estimated to be 2.57 days. Table 5 shows the proportions of pre-symptomatic transmission and reproduction numbers for each dataset. Pre-symptomatic transmission is higher when allowing for negative serial intervals for cases with no known infector. The reproduction number is lower when estimated using the serial interval compared with when using the generation interval.

Discussion
We estimated the generation time to have a mean of 5.20 days (95% CrI: 3.78-6.78) and a SD of 1.72 days (95% CrI: 0.91-3.93) for the Singapore data, and a mean of 3.95 days (95% CrI: 3.01-4.91) with a SD of 1.51 days (95% CrI: 0.74-2.97) for the Tianjin data. These mean estimates increased only slightly when increasing the mean incubation period. For the Singapore data, allowing the serial interval to be negative decreased the estimated mean generation time from 5.20 days, when restricting missing serial intervals to be positive, to 3.86 days (95% CrI: 2.22-5.60), when allowing them to be negative. For the Tianjin data, the baseline estimate of the mean generation time (3.95 days) is about the same as when allowing serial intervals to be negative in the Singapore data. However, there were already some negative serial intervals among the reported links in the Tianjin data, which may explain this lower estimate. The difference in these estimates could also be the result of differences in containment strategies. When allowing for negative serial intervals in the Tianjin data, the mean generation time decreased to 2.90 days (95% CrI: 1.85-4.12). The sensitivity analyses showed that the assumptions made about the incubation period have only moderate impact on the results. On the other hand, assumptions made about the underlying transmission network (e.g. acknowledging possibly negative serial intervals) had a large impact on our results.

A. Symptomatic transmission
The upper figure of panel B shows pre-symptomatic transmission where the infector develops symptoms before the infectee (i.e. positive serial interval), whereas the lower figure shows presymptomatic transmission where the infector develops symptoms after the infectee (i.e. negative serial interval). Note that the figure does not include asymptomatic transmission, i.e. infected individuals who may not show symptoms but can transmit infection. data, and from 62% (95% CrI: 50-76) to 77% (95% CrI: 65-87) for the Tianjin data. When the incubation period is larger, it is expected that these proportions will be higher and when it is smaller, they are expected to be lower. Hence, a large proportion of transmission appears to occur before symptom onset, which is an important point to consider when planning intervention strategies. It is worth noting that the outbreak data we used were collected in the presence of intervention measures such as case isolation and quarantining of identified contacts. This means that our estimates do not necessarily reflect the natural epidemiology of COVID-19, but instead reflect what is observed in the presence of these intervention measures. It is expected that these measures reduce the proportion of symptomatic transmission, which implies that a high proportion of infections is likely to have occurred before symptom onset because isolation prevents symptomatic transmission.
We also estimated R for the sole purpose of illustrating the bias that occurs when using the serial interval as a proxy for the generation interval [8]. Whereas the impact was limited for our analyses, estimates based on the generation interval are larger and should be preferred to inform intervention policies. Indeed, as expected, the reproduction number was underestimated when using the serial interval distribution which is more variable than the generation interval distribution.
Tindale et al. [15] recently estimated the mean serial interval for COVID-19 to be 4.56 days (95% CI: 2.69-6.42) for Singapore and 4.22 days (95% CI: 3.43-5.01) days for Tianjin. Although these estimates are different from the ones we report, they fall within the uncertainty ranges we obtained. An important advantage of our method is that we are able to infer the generation interval distribution while allowing serial intervals to be negative. Our estimates of R are smaller than the ones reported by Tindale et al. [15] because we use a different estimate of the growth rate r. To expand, we used 0.04 for Singapore and 0.12 for Tianjin, as obtained from the initial exponential growth phase in each dataset, compared with the 0.15 used by Tindale et al. [15]. Our estimates of the serial interval are also in line with those of Du et al. [16], which estimated a mean of 3.96 days (95% CI: 3.53-4.39) and a SD of 4.75 days (95% CI: 4.46-5.07).

Table 1
Parameter estimates and credible intervals of generation and serial interval distributions of COVID-19 using reported information on infector-infectee pairs and assuming an incubation period with a mean of 5.2 and a SD of 2.8 days, Singapore, 21 January-26 February 2020; Tianjin, China, 14 January-27 February 2020     Another advantage of our method is that we can derive a proper variance estimate for the generation interval, in contrast to using a too large variance estimate that is obtained when using the serial interval as a proxy for the generation interval. Furthermore, from a biological point of view, we do not need to condition on the order of symptom onset times. However, when the data do not provide sufficient information on directionality of transmission, this lack of auxiliary information may cause problems for estimation.
Our study does have some limitations. First, we rely on previous estimates for the incubation period. However, our sensitivity analyses showed that changing the incubation period distribution does not have a big impact on our estimates of the generation interval distribution. Second, we do not account for incomplete or possible changes in reporting over the course of the epidemic. Incomplete reporting means that cases are missing, with this leading to incomplete transmission networks. As the underlying transmission network has a large impact on our estimates, incomplete reporting may bias our estimates. Third, we do not acknowledge changes in contact patterns and thus behavioural change, which could shape realised generation interval distributions as well as serial interval distributions (data not shown). Fourth, we do not account for contraction of the generation interval because of depletion of susceptibles. Future work should take these shortcomings into account.
In the beginning of the pandemic, infection control for the COVID-19 epidemic relied on case-based measures such as finding cases and tracing contacts. A variable that determines how effective these case-based measures are is the proportion of pre-symptomatic transmission. Our estimates of this proportion are high, ranging from 48% to 77%. This implies that the effectiveness of case finding and contact tracing in preventing COVID-19 infections will be considerably smaller compared with the effectiveness in preventing severe acute respiratory syndrome coronavirus (SARS-CoV) or Middle East respiratory syndrome coronavirus (MERS-CoV) infections, where pre-symptomatic transmission did not play an important role (see e.g [17]). As has been shown by other studies, e.g Hellwell et al. [18], it is unlikely that these measures alone will suffice to control the COVID-19 epidemic. Additional measures, such as physical distancing, are required and are already implemented in most countries.