Analysis of consumer food purchase data used for outbreak investigations, a review

Background Investigations of food-borne outbreaks are frequently unsuccessful and new investigation methods should be welcomed. Aim: Describe the use of consumer purchase datasets in outbreak investigations and consider methodological and practical difficulties. Methods: We reviewed published papers describing the use of consumer purchase datasets, where electronic data on the foods that case-patients had purchased before onset of symptoms were obtained and analysed as part of outbreak investigations. Results: For the period 2006–17, scientific articles were found describing 20 outbreak investigations. Most outbreaks involved salmonella or Shiga toxin-producing Escherichia coli and were performed in eight different countries. The consumer purchase datasets were most frequently used to generate hypotheses about the outbreak vehicle where case-interviews had not been fruitful. Secondly, they were used to aid trace-back investigation, where a vehicle was already suspected. A number of methodological as well as (in some countries) legal and practical impediments exist. Conclusions: Several of the outbreaks were unlikely to have been solved without the use of consumer purchase datasets. The method is potentially powerful and with future improved access to big data purchase information, may become a widely applicable tool for outbreak investigations, enabling investigators to quickly find hypotheses and at the same time estimate odds ratios or relative risks hereof. We suggest using the term ‘consumer purchase data’ to refer to the approach in the future.


Introduction
Food-borne illnesses are a considerable cause of mortality, in particular among children, in the developing world and an important cause of morbidity in the developed world. Work from the World Health Organization (WHO) Food-borne Disease Burden Epidemiology Reference Group has estimated that 600 million foodborne illnesses occurred worldwide in the year 2010, leading to 420,000 deaths. In the WHO European Region, an estimated annual 23 million illnesses occur [1]. In the United States (US), it has been estimated that food-borne illness that can be specifically attributed to the major pathogens affects more than 48 million citizens annually [2] and amounts to an economic burden of several billion US dollars [3].
In the European Union (EU), food-borne disease outbreaks occur also frequently. In 2015, 4,362 outbreaks were of such relevance that they were reported to the European Centre for Disease Prevention and Control (ECDC) and the European Food Safety Authority (EFSA) [4], and the control of outbreaks lies at the heart of the effort to reduce food-borne illnesses. Investigations of outbreaks help stop disease transmission, contribute to our understanding of the underlying outbreak drivers, and help to improve food safety. However, investigating food-borne outbreaks is often not a straightforward task. For dispersed outbreaks where microbiological proof often cannot readily be obtained, the steps of finding hypotheses, generally done via extensive interviews with outbreak cases -and proving/disproving hypotheses, generally done by use of analytical epidemiology, are difficult but critical factors for the success of the investigation. Outbreaks caused by agents with a long incubation time or by several different products, products with long shelf lives, low brand recognition, or representing subsets of foods that are very commonly consumed are especially hard to resolve through patient interviews. Thus alternative methods for their investigation should be considered. One such method utilises individualised consumer purchase data to resolve outbreaks, taking advantage of the fact that many retailers collect and store this information in searchable databases. The method has been used irregularly over the past decade with heterogeneous reporting and methodology and more wide-scale, systematic implementation has not ensued.
In this article, we review the literature on consumer purchase data use in outbreak investigations. We classify different categories of usage in the literature and address methodological difficulties and further outline some future perspectives.

Methods
We searched for and included published studies in English involving food-borne outbreaks where consumer purchase data (e.g. loyalty card or credit/ debit card data) were applied in outbreak investigations. The search was conducted in August 2016 using the PubMed database and Google Scholar and was repeated in October 2017 with the additional inclusion of the Scopus and Web of Science databases. The latter search combined the search terms ('disease outbreak*' AND 'food*') OR 'food contamination*' OR 'foodborne disease' with the search terms 'card*' OR 'receipt*' OR 'loyalty* OR 'till*' or 'membership*'. MeSH terms were used in Medline. In addition, further studies cited within the papers or already known to the author group or collaborators were also included. The search included papers published from January 2006 to 20 October 2017. Papers describing simulated outbreaks were excluded, as were papers where consumer purchase data were not applied in relation to foodborne outbreak investigations. The search was done by one author. Papers selected for narrative synthesis  were assessed by two authors and discussed within the author group to reach consensus on methodology. A classification of use was made within the categories: hypothesis generation, trace back, corroboration of hypothesis, analytical usage.
In this paper we have used the term 'consumer purchase data' to cover different sources of information  for purchases of food, e.g. a credit card or a loyalty card, to cover the entire process of using the data as a method for outbreak investigations.

Hypothesis generation
In all reports of dispersed outbreaks, the investigators followed the standard approach of aiming to generate a hypothesis as to the vehicle of the outbreak, using hypothesis-generating interviews with standardised questionnaires. Consumer purchase data were used in situations where the initial hypothesis-generating activities did not lead to a hypothesis or where the product category suggested was unspecific. Casepatients gave permission to the outbreak investigation teams to access sources of information that could be used to perform a search. This involved loyalty or 'shopper' card numbers, credit or debit card numbers or (online) bank statements detailing supermarket purchases. This information was then taken to the retailers to search for computerised data on all specific products bought during each particular transaction done before onset of symptoms. The time window of purchases was defined as 3 weeks [18,20], 6 weeks [24] or 3 months [8] before onset of symptoms or was not mentioned. Some investigators performed the search manually, others in a semi-automated manner by retrieving the data from a central supermarket computer system.
In seven studies (Table 2), the use of the method gave a narrower range of candidate products (often only one) than the range of products initially identified using hypothesis-generating interviews [5,10,12,15,18,20,24]. In one outbreak, no hypotheses were found [19]. The number of cases from whom consumer purchase information was obtained ranged from four to 43 in the reviewed studies. Following use of the method, testing of hypotheses was generally performed using standard methods, such as case-control studies or microbiological analysis of foods.
In one example, a local cluster of Salmonella Enteritidis cases detected by routine surveillance was investigated. Use of the methods on a subset of cases identified three possible hypotheses, tomatoes, avocado and pine nuts. Further investigation, including microbiological examination of products collected from cases' homes identified pine nuts as the source of the outbreak [12]. In a second example of STEC O26 infections affecting primarily children under the age of 3 years, interviews with parents failed to produce workable hypotheses. Comparison of purchase data from seven families revealed that six of these had bought a specific brand of organic beef salami before onset, a product that none of the parents had reported during the interviews. A subsequent case-control study corroborated this product as the source of infections and the outbreak strain was later also isolated from the product [24].

Analytical epidemiology
None of the studies of dispersed outbreaks used consumer purchase data for a regular analytical study, i.e. to produce a measure of association, such as an odds ratio. However, in several instances, the results obtained from use of the method were of sufficient specificity to produce convincing evidence as to the outbreak source. In an outbreak of salmonellosis in France, epidemiological investigations led to the hypothesis that salami-style pork sausage was the vehicle. Of 39 cases whose shopping data in one supermarket chain were retrieved, 22 had bought such sausages and 15 had bought exactly the same product from a single producer. Using overall sales data from the supermarket chain, this product was found to constitute only 3% of all salami sales. Based on this, a recall of the sausage was undertaken [21].
Two reports concerned analytical usage in a pointsource outbreak setting. Following an S. Enteritidis outbreak found to be associated with a take-away restaurant in London, sales data were used to point to a particular chicken meal. This was done by comparing sales made by cases with sales made by other costumers at the same hour the day before [11]. The second report concerned an outbreak within the outbreak of the larger German O104 STEC outbreak in 2011 [25]. It occurred among employees of a company and was linked to the company canteen where employees paid for lunch meals using their employee access cards. This meant that the employees' lunch choices were being electronically registered. This way, in a retrospective nested case-control study within the cohort, the strength of an association between cases and sproutcontaining salad meals could be estimated [22].

Trace back or trace forward
In 13 studies, trace-back and/or trace-forward investigation was performed by use of consumer purchase data, once a probable source of the infections had been identified (Table 2) [5][6][7][8]10,[13][14][15][16][17]20,21,23]. The source of the infections in the studies ranged from vegetables, fruits and nuts (raw tomatoes, organic basil, blueberries, frozen fruit blend, pine nuts), to meat products (including beef burgers, poultry, delicatessen sausages and meat as well as ground turkey, dried pork sausages, fermented sausage, and rotisserie chicken (Table 1). In some studies, this trace back formed part of the evidence for what constituted the source of the outbreak.
In one outbreak, hypothesis generation was guided by loyalty card-derived purchase data, which revealed a specific type of salami as a common food purchase. The purchase data therefore also facilitated locating the distributor. The resulting trace-back investigation indicated that dried pepper, used as an ingredient in the salamis, was the probable source of the outbreak. Trace forward led to further identification of tainted products including human cases affected by a second Salmonella serotype found in a red pepper storage facility, thereby extending the understanding of the outbreak [15]. In hepatitis A virus outbreaks in Canada and Scandinavia, frozen fruit/berries were identified as sources. The long incubation period and the fact that multiple similar product categories existed made trace back a challenge. Analysis of purchase data records allowed investigators to pinpoint the precise products via the food product identification codes without which trace back would most likely not have been possible [6,8,17].
Finally, in one outbreak [7] consumer purchase data was used to directly target exposed individuals. In this hepatitis A virus outbreak in the US, purchase data was used to define cases (purchase/exposure being part of the case definition) and further to warn customers who had purchased the product by use of automated voice-message phone calls and to target post-exposure immunisation to exposed costumers. This was carried out by the affected retail chain, and not through data sharing with public health officials.

Discussion
In this review, we found that consumer purchase data have been applied successfully in several phases of outbreak investigations. In the studies reviewed, the method was used for forming or assisting in forming hypotheses for the source/vehicle of the outbreaks where prior interviews had proven insufficient. Additionally, purchase data often aided source finding, providing a product subtype and sometimes even a lot or batch number. In some outbreaks, time to product recall was reduced, in others it was unlikely that the source would have been found, had it not been for the purchase data. The low number of documented purchase events needed in many of the studies to identify a probable source is a promising finding. Conversely, 20 papers published over the last decade represents a rather low number, suggesting the existence of obstacles to widespread use. We suggest using the term 'consumer purchase data' in future to refer to the approach as we think this term better captures the different aspects of the approach that we encountered than terms using the word 'card'.
Critical steps in the investigation of food-borne outbreaks concern identification of suspect food products and providing proof of the source beyond reasonable doubt. We believe the evidence available from the papers reviewed here suggests that the use of purchase data may be a generalisable investigation method that could be very attractive for the investigation of challenging food-borne outbreaks. As some of the papers showed, searching through datasets across households with case-patients for common purchases may often be a more powerful method than the standard methods of interviewing case-patients, which are subject to incomplete recall. Interviews are less efficient in situations where, for instance, the period between interview and exposure is long [26] or the food is of a kind that is unlikely to be reported on, such as foods that are hard to remember (e.g. sprouts), food ingredients or sub-batches of common foods.
Establishing proof is generally possible using one of three strategies: microbiological evidence (finding the disease agent in the food using a specific typing method), epidemiological evidence (showing that a strong association between case status and a specific food consumption is present) or food supply evidence (showing a correlation between cases exposure and the presence of the incriminated foods). The papers we found generally did not use the purchase data method with the purpose of establishing proof. Potentially, however, strong evidence could be established by use of the purchase data method. If large purchase datasets from retailers were to become routinely available to outbreak investigators, comparisons could be made between case and non-case consumers. Thus, odds ratios for purchase could be calculated immediately and the process of searching for candidate foods (hypothesis generation) and the subsequent step of assessing their likelihood as outbreak vehicles (analytical epidemiology) could be performed in a single step.
In addition, the methods may be a powerful tool for product identification, trace-back/trace-forward investigation and assessing likelihood of a food being an outbreak vehicle through comparisons of distribution and intensity of sales. A purchase data analysis could provide codes identifying the foods uniquely, such as European/International Article Numbering (EAN) or Global Trade Item Number (GTIN). This may potentially lead to efficient and fast comparative analyses using food databases. The latter is important, because traceback investigations for larger outbreaks may reach levels of complexity where they become impossible to perform with traditional methods in addition to being lengthy and labour-intensive.
Such a framework would be strengthened by the increased penetration of card or mobile phone-based payments, expected to occur in the coming years. Combined with the foreseen increased application of whole-genome sequencing for routine surveillance of food-borne infections, it might also be valuable for the investigation of small or protracted outbreaks from continuous sources where cases are currently regarded as sporadic. Likewise, it may also be valuable for source attribution purposes, i.e. to describe the relative distribution of the sources which give rise to sporadic cases. Finally, as seen in one outbreak [7], it may be used to find and warn customers who have bought a product found to be contaminated and may thereby also help stop further cases [27].
Importantly, however, a number of requirements of a structural nature would need to be resolved before widespread use of the method could take place. These requirements include legal frameworks for ensuring consumer protection and patients' privacy and the need to establish and maintain agreements between public health institutions and retailers securing data access. Data protection regulations and other obstacles for data access differ between countries and this may be the reason for why application of the method was geographically skewed. Adding to that, a number of more general methodological obstacles exist. First, purchase does not equal consumption and cases may often be part of families or households so that food purchases by several persons may need to be collected. Secondly, capturing foods consumed in restaurants or smaller retailers including convenience food remains a challenge, and thirdly, purchases made without the use of loyalty or payment cards will go unnoticed with current coverage and payments systems. Finally, not all retailers may wish to share data, affecting the coverage of the purchasing data. However, even if only imperfect data can be retrieved, the method may still produce results. An analogy can be drawn with standard disease surveillance, which often also captures only a fraction of all cases, but nonetheless is useful for finding and solving outbreaks. Hence, incompleteness in exposure assessment should not preclude efficient use of the method.
Overall, the papers we found and included contained little detail on how purchase data analysis was applied. The handling of data was most often not described in detail. With few exceptions [18,21], the total number of receipts retrieved, the period and the fraction of total purchases these receipts covered were not accounted for. Also, restrictions or obstacles of a legal, cultural or habitual nature were generally not mentioned and we could therefore not extract data on such matters. The papers did in general mention good working relationships between public health authorities and food retailers. Efforts to protect citizen privacy were not described in detail. Secure systems to handle potentially sensitive purchase data, systems to obtain consent, and share data are prerequisites of a wider implementation of consumer purchase datasets, and descriptions hereof in future studies would be beneficial.
This review has several limitations. A broader literature search including more search terms, languages other than English or including unpublished outbreak reports might have revealed more studies. We also limited our search to after the year 2005, but we note that studies taking advantage of shopping receipts in paper form also exist from before this time [28]. The papers generally report successful use of consumer purchase data; however, this could be partly due to publication bias, which is known to affect reporting of food-borne outbreaks [29]. We found one example where consumer purchase data were used for investigation of a large outbreak without finding the source [19], but it is possible that more unresolved and unpublished outbreaks using the consumer data method exist.
In conclusion, the reviewed papers describe a powerful outbreak investigation method. It holds promise of developing into a routinely applied tool provided that more automated procedures reducing labour for retailers as well as epidemiologists and ways of making data more available could be found. We envision a near future where food purchase information in some countries can be automatically collected from cases of food-borne infections and compared with that of a large panel of non-cases. Such a system would significantly improve source-identification and risk-assessment efforts, facilitate efficient trace back enabling timely interventions and reduce illness caused by foodborne pathogens.