A system for automated outbreak detection of communicable diseases in Germany

We describe the design and implementation of a novel automated outbreak detection system in Germany that monitors the routinely collected surveillance data for communicable diseases. Detecting unusually high case counts as early as possible is crucial as an accumulation may indicate an ongoing outbreak. The detection in our system is based on state-of-the-art statistical procedures conducting the necessary data mining task. In addition, we have developed effective methods to improve the presentation of the results of such algorithms to epidemiologists and other system users. The objective was to effectively integrate automatic outbreak detection into the epidemiological workflow of a public health institution. Since 2013, the system has been in routine use at the German Robert Koch Institute.


Introduction
In recent years, more and more data have been collected for the routine surveillance of infectious diseases. For instance in Germany, the Robert Koch Institute (RKI) implemented a national electronic surveillance system (SurvNet@RKI) in 2001 in response to the newly enacted Protection against Infection Act that requires regular collection of data on a number of notifiable diseases [1]. Cases are first reported by laboratories or physicians to local health authorities that may perform further investigations, and then transmitted to the RKI via the federal state health authorities. Collected information about cases includes sex, age, and subtype of the pathogen.
In addition to increasing data collection, a multitude of different outbreak detection algorithms for routinely collected public health data have been published [2]. Nonetheless, the added value of applying statistical methods for aberration detection at public health institutions is still subject to discussion because of several challenges, one of which is automating the data analysis and identifying signals without producing a plethora of signals. For instance, on October 2015, the SurvNet@RKI database contained ca 6.0 million case notifications in 88 different reporting categories such as Salmonella or norovirus, while outbreaks often become apparent when inspecting certain subsets of the data, e.g. within a specific geographical area or even a specific age group [3]. The problem is therefore to promptly identify these relevant subsets in the haystack of data. One statistical approach to this problem is to regularly analyse the data as multiple univariate time series in order to detect unexpected aberrations in specific subsets.
Nowadays, a semi-automatic monitoring system is in operation in many public health institutions (for examples in Europe see [4]). But because of too many signals or a misalignment between users' needs and signal presentation, the system output often has little impact on the practical work of these institutions. First attempts to focus more on the user perspective of monitoring systems are presented in Cakici et al. [4] and Kling et al. [5]. Our goal was to develop and establish an automatic information system that supports epidemiologists at the RKI in the timely detection of potential outbreaks of communicable diseases.
In this article, we present the implementation of a novel automated monitoring system at the RKI in Germany.
The new system is now in routine use at the RKI for many reporting categories. Here, we describe the architecture of the system and our design decisions as well as first results and planned improvements. In sharing our experiences we aim to provide valuable information to others working on similar surveillance systems.

Defining features of the system
We wanted to obtain results of a consistent quality as well as a standard procedure for the routine surveillance workflow in our organisation. This objective lead to specific requirements for the system that were largely in line with the checklist for computer-supported outbreak detection systems formulated by Hulth et al. [6]; that article contains recommendations such as user friendliness and tight integration with the database. The development of the system and the refinements of the requirements were conducted iteratively. Based on the rapid prototyping philosophy, we initially focused on building a first prototype for one reporting category, namely Salmonella with its many serotypes.
Once the prototypes of the components had produced first results, we started discussing with two users, the epidemiologists in charge of Salmonella, the output of the system for Salmonella. The experiences from the first prototype lead to the design of a weekly automated report sent by email to the two epidemiologists. Once the system produced satisfactory results for this reporting category, we progressively scaled up the system to 48 reporting categories which account for roughly 80% of all received cases. Our goal has always been to create a general system for a variety of diseases instead of highly disease-specific solutions. In addition to the one-on-one discussions with the system users, we received more feedback and feature requests as the system grew.

System design
The system consists of two components: an automatic component routinely monitoring the data and a manual component which enriches data queries with ad hoc aberration detection ( Figure 1). The first component automatically produces surveillance reports according to pre-defined settings. The second component allows the user to make customised queries for any time series they wish.

Automated analytical process
As shown in Figure 1, the automatic component consists of three subsystems: an analytical process, a signal database and a signal interface. The analytical process analyses the data with aberration detection algorithms and, in case of an unusually high number of cases, produces a signal which are stored in the signal database and communicated to the user through the signal interface.
The analytical process monitors the SurvNet@RKI case counts of the current and the six previous weeks on a daily basis for all reporting categories selected for aberration detection. Since outbreaks can occur in specific subsets of the population, e.g. at a specific location and in a specific age group, we monitor in parallel numerous time series corresponding to the respective subsets of the population in order to detect The user can receive output from the automatic component of the system which consists of reports generated with predefined settings.
Excerpts of such output are shown in Table 1 and Table 2. The user can also actively make ad-hoc queries to the manual component of the system the output of which is illustrated in Table 3.
signals that would be invisible when analysing the whole population. In particular, we stratify the time series by pathogen subtype (e.g. Salmonella serotype such as S. Infantis) or symptom (e.g. pneumonia), location (federal state, county), age group, sex, place of exposure. This stratification yields a set of univariate time series for each reporting category aggregated per week or month. The number of diagnostic tests performed is not a variable collected in the German mandatory reporting system. Therefore, the analysis of the numbers is sensitive to variations due to, for example, changes in laboratory procedures or in healthcareseeking behaviour, e.g. during an outbreak with much media attention.
The system applies the implementation of the algorithm of Noufaily et al. [7] as described in Salmon et al. [8] to each time series in order to get a threshold for each observed count. The last four years of historic data are used as reference values for the algorithm. The algorithm uses an overdispersed Poisson generalised linear model with log link. The linear predictor accounts for seasonality through a 10-level factor variable, includes a time trend and uses a re-weighing scheme for taking past outliers into account. The estimates from the regression model are used to compute a threshold specific for each monitored week, defined as a quantile from the predictive distribution of the current count. A signal is generated for time t 0 if the observed number of cases exceeds the threshold. As an example, Figure 2 illustrates the detection algorithm applied to a single time series of S. Montevideo in Germany in 2009 and 2010 [9]. To address reporting delay, we monitor the current week and the six weeks before, i.e. it is possible to obtain a signal for one of the six previous weeks given the current data. This could mean getting a signal in week 5 of 2015 for the number of Salmonella infections reported during week 3 of 2015.
The automated analytical process was initially implemented solely in the statistical programming language R [10], using the surveillance package [8,11] for the detection part and other R packages [12] for the data pre-and post-processing steps, as well as for support for behaviour-driven software development. As in other systems [4,13], the automated component was built in a modular way so that the detection component can incorporate different detection algorithms. R was chosen over other programming languages as it allowed us to directly use a variety of statistical detection algorithms and visualisation procedures out of the box and because of its ability to rapidly prototype statistical procedures. During subsequent developments we ported large parts of the data management components to Microsoft C#/.NET to harmonise the system with existing information technology infrastructure at the RKI.

Signal database
The signal database stores signals generated by the analytical process. A signal corresponds to statistical evidence that the case count in a given subset of the data is higher than we would expect it to be based on historic data. A signal combines information about that data segment in which case counts were detected by a statistical algorithm, i.e. a filter on the data with a set of attributes (e.g. 'Hepatitis A; week 25 of 2013') and about the algorithm itself, its configuration and its output (e.g. the detection threshold).
This definition can be used directly to store the signals in the signal database and enables subsequent processing of the signals. This has direct advantages over analysis and communication as a combined step: the signals can have an age, they can be more or less important, they can be similar to each other and they can disappear over time when new data are received. In addition, signals can be communicated differently based on aspects such as user preferences.

Signal interface and communication
Signals are communicated to the user through predefined report templates for each reporting category. The reports display relevant signals found for a given category within a given time period. In addition to these main reports, several other reports display new signals found recently (for instance a signal at week 46 for a number of cases reported during week 45, which did not give a signal at week 45 or because of transmission delays), line lists and a spatial visualisation of the cases. The main reports are archived as Microsoft Excel files once a day and are sent by email to epidemiologists in charge of specific reporting categories once a week. Such a push/pull principle of communication was inspired by other monitoring systems such as the one described by Reis et al. [13]. The algorithm is described in [7] and its open-source implementation in [8]. Blue bars indicate the observed number of cases. The red arrow indicates a signal.
The signal interface uses Microsoft SQL Server Reporting Services [14], mainly because it is already used at the RKI. It allows quick development of the reports that can be accessed from the Intranet through a web browser and supports the exportation of the reports as Microsoft Excel files. Furthermore, in order to support the decision on whether a signal is relevant, the user can click on any case count in the report to see the associated list of cases from the SurvNet@RKI database (line list).   The procedure makes use of the fact that each signal is associated with a filter for a set of attributes, e.g. geographical location, temporal location, sex and age group. Given a set of signals available for reporting, we first determine similar signals by partitioning the original set of signals into a set of signal groups. All signals within a specific group have equal values for a number of filter attributes. For example, we could group all signals by week so that each signal group consists of signals with the same reporting week; e.g. 2013 week 42. In the system at the RKI, we group all attributes except sex, age group and reporting location of the signal. Thus the signals within a group will not necessarily have the same values for sex, age group and location.

Signal abstraction
In a second step, we filter out some signals in each of these groups, while other similar signals are not filtered, to avoid presenting information which is not considered relevant for users. This is done by so-called filter relations which allow us to rank and compare signals according to a predefined metric. We use three different relations: 'more specific than', 'more general than' and 'more specific on the location and more general on age and sex'. The user can select between having no reduction, one of the three relations or a combination of the first two relations. In our example, the most general signal would be the signal for Salmonella in week 22 in Bavaria, whereas the most specific signal would be the signal for Salmonella in week 22 in Munich for male cases. It is therefore possible to focus the analysis of the signals on specific aspects, e.g. locating the centre of a possible outbreak by displaying only the most specific signals in terms of their filter attributes.

Manual analytical component
In addition to the automatic tool for outbreak detection, we also prepared a detection tool that can be applied to almost any subset of the data defined by the user, allowing users to screen very specific time series on demand, which was a wish expressed during meetings conducted with future users before the design of the system. This component monitors specific subsets of the data, for example case counts of hepatitis A in Berlin within the last six weeks, by comparing the current counts with past data, using a method similar to the algorithm of Stroup et al. [15].

Report interface
As at October 2015, 62 users at the RKI and federal state health authorities received weekly reports from the automated component and interacted with the reports. Table 1 and Table 2 correspond to an excerpt of the Excel-based report for cases of Salmonella infection reported in weeks 41 to 46 in 2013. The report contains two data tables with a similar structure. For each week t, we report the number of cases y t , the estimated expected case count μ t , the threshold U t and the number of cases o t that were manually marked as being part of an outbreak in the SurvNet@RKI database. Cases are sometimes identified as a cluster by local health authorities, e.g. a cluster of cases of norovirus infection after a shared meal. Coloured cells in the Tables indicate signals for the respective week. Signals that were detected seven or more days before the current week are marked yellow, newer signals are marked red. Table 1 corresponds to the reported number of cases per serotype for the six weeks before the current week, in this example with a signal for S. Infantis in week 41. Table 2 displays the results of a stratified analysis as described in the previous section. In this example, we see a cluster of female cases of S. Manhattan infection in week 41. Some of these signals prompt further checks by epidemiologists, helped by a direct link between the signal and the corresponding cases (line list). The number of signals in a report is an interplay between the number of time series formed by the considered subgroups of sex, age and geographical location, the algorithm settings for the disease, and whether signal reduction is performed. From January to October 2015, the median number of signals over all filters in the weekly Salmonella report was 62.

Experiences from operation
Since 2013, the monitoring system has been widely adopted at the RKI. Although it has not been formally evaluated yet, we can observe a positive user acceptance, for example supported by an increasing number of users and feedback in discussions. Furthermore, the system has contributed to several outbreak investigations. For example, it detected a large local outbreak of cryptosporidiosis in August 2013 [16]. Apart from outbreak detection, the tool has provided awareness to epidemiologists, especially those monitoring trends in frequently notified infections prone to causing outbreaks: the number of cases for various aggregations of the data can now easily be visualised. Moreover, the aberration detection tool for dynamic data queries on case counts is appreciated because it is not always straightforward to visually assess whether the numbers of a time series plot are higher than usual. The manual component of our system provides a statistically informed decision for this.
We developed a system that provides results that are easy to understand and use, while based on sound statistical methods, with disease-and user-specific adjustments. The system was the result of an interdisciplinary collaboration between computer scientists, statisticians and epidemiologists combining userfocused system design, correct treatment of uncertainty and infectious disease knowledge to obtain a decision support tool useful for everyday practice.
Although the system already produces valuable results for routine work at the RKI, a number of improvements are possible. We are working on the problem of comparing frequently incomplete first-version data (e.g. where a pathogen subtype and a possible travel history of the case may now be known yet) to historic, more complete, last-version data (e.g. where subtype and probably country of infection have been added); each version is automatically numbered by the system each time a change is made to a case report. Moreover, it may be possible to add specific detection algorithms for dealing with reporting delays [17,18]. Furthermore, we are currently only able to detect outbreaks when case numbers are above the threshold in at least one week, i.e. if an outbreak emerges very slowly over several weeks it might not be detected quickly. Here, cumulative sum (CUSUM)-oriented procedures could be better at picking up the signal [19] because they add evidence over several timepoints. On a geographical level, only a fixed set of regions is monitored: Germany as a whole, federal states, counties and each county with its adjacent neighbours (which may overlap state borders). Thus we are only able to geographically detect outbreaks that are visible in one of these predefined county clusters. However, the architecture of the system would allow us to include more sophisticated space-time methods into the surveillance process such as those used by Kulldorff, Tango et al. and Neill [20][21][22]. In addition, performing tests on any time series is a classical case of multiple testing and thus leads to false alarms. Currently, we offer the epidemiologists the linelist to delve deeper into the data generating the signals in order to better understand the context, so that they can easily navigate the different signals of a report. A framework for controlling overall false alarm rates individually for each user in combination with the signal abstractions could further improve user acceptance.
We see these accounts of successful implementation at public health institutions an important contribution because automatic detection systems are much needed in the current big data environments arising from routine surveillance data collection. Our aim here was to explain the RKI development strategy and user focus of the system. A more technical article describes the algorithmic functionality of the R surveillance package [17].
The amount of data held by public health institutes will certainly continue to grow. As a consequence, automatic outbreak detection systems such as the one presented here, will become increasingly important. At the same time, care is needed when integrating such a system into a workflow and taking further steps towards user acceptance. From an organisational point of view, a challenge is to design effective guidelines on how the generated signals are to be handled in a standardised way. This could range from considering signals only as an additional resource for surveillance to having each signal checked by an epidemiologist. Now that the system is in place, one could in the future tailor the detection even more to the needs of the users, e.g. by actively including user feedback in the statistical detection algorithms. Including user feedback could start by collecting appropriate data about the users' reaction to each signal. We think that our experience with an automatic surveillance system will motivate the development and maintenance of similar decision support tools in other European countries.