The use and reporting of airline passenger data for infectious disease modelling: a systematic review

Background A variety of airline passenger data sources are used for modelling the international spread of infectious diseases. Questions exist regarding the suitability and validity of these sources. Aim We conducted a systematic review to identify the sources of airline passenger data used for these purposes and to assess validation of the data and reproducibility of the methodology. Methods Articles matching our search criteria and describing a model of the international spread of human infectious disease, parameterised with airline passenger data, were identified. Information regarding type and source of airline passenger data used was collated and the studies’ reproducibility assessed. Results We identified 136 articles. The majority (n = 96) sourced data primarily used by the airline industry. Governmental data sources were used in 30 studies and data published by individual airports in four studies. Validation of passenger data was conducted in only seven studies. No study was found to be fully reproducible, although eight were partially reproducible. Limitations By limiting the articles to international spread, articles focussed on within-country transmission even if they used relevant data sources were excluded. Authors were not contacted to clarify their methods. Searches were limited to articles in PubMed, Web of Science and Scopus. Conclusion We recommend greater efforts to assess validity and biases of airline passenger data used for modelling studies, particularly when model outputs are to inform national and international public health policies. We also recommend improving reporting standards and more detailed studies on biases in commercial and open-access data to assess their reproducibility.


Introduction
International movement of individuals through commercial airline travel has been implicated in the transnational dissemination of many infectious diseases and is thought to be the principle mode of human pathogen transfer between continents. Examples include the global dissemination of the outbreak of severe acute respiratory syndrome in 2003 which quickly spread from Hong Kong to North America [1]. The 2009 influenza pandemic [2], which emerged in Mexico and affected more than 208 countries, followed a similar international dissemination [3]. There is, year-on-year, an increasing number of airline travellers, with a total of 1,186 million international tourist arrivals globally in 2015, a 4.6% increase from 2014 and 510 million arrivals more than in 2000 [4]. In addition, tourism visits to emerging economies are now comparable to those of high-income countries, with countries such as Mexico and Thailand entering the top 15 of the most visited destinations. The global trend is expected to keep rising and reach 1.8 billion arrivals in 2030 [4]. Lower fares and greater availability make geographically distant destinations easier to reach for a greater number of people [5].
With the volume of airline passengers increasing each year [6], it is important to understand the dynamics of the airline network and its role in disease spread and control [7]. We need to be able to accurately predict international transmission through passenger flow. Mathematical models are useful tools that can estimate the risk of infectious disease importation and exportation by international airline passengers [8], especially in the early stages of an outbreak when accurate reporting may be difficult [9]. Models such as the one developed by Lopez et al. use the force of infection in the visited country to determine the risk to international visitors, assuming an arbitrary number of airline passengers [8]. However, this risk can also extend to new areas when returning passengers carry pathogens back to their country of residence, as was the case in Italy in 2007, when an autochthonous chikungunya outbreak occurred following importation [10]. Mathematical models of pathogen importation/ exportation risks usually entail a function of the infection level in the visited country and the airline passenger volume between the two involved geographical locations, as described by Quam and Wilder-Smith [11]. Access to accurate and appropriate data sets describing passenger flow between locations is crucial when developing transmission models of global spread [12]; such models can explore the potential role the airline network may play in the spread of disease, but also predict future spread, particularly when new threats emerge. However, a variety of data sources have been used leading to inconsistency and incomparability between modelling studies [7]. The sources themselves are generally not designed for epidemic modelling purposes. They include data for use within the aviation industry, which may be expensive to access and impose user restrictions, including prohibition to share with a third party [7,12]. Open-access data sources do exist but may be geographically restricted, provide information in forms not easily convertible into passenger numbers or are limited in temporal resolution [7].
To gain an overview of the range of airline passenger data sources used by modelling studies, a systematic literature review was designed and conducted. The principal aim of the review was to determine the data If multiple sources were used in an article, the average score was calculated.
Diio: data in, intelligence out; IATA: International Air Transport Association; MERS: Middle East respiratory syndrome; OAG: company providing air travel data; UNWTO: World Tourism Organization. a If studies used a third party's travel model and if they did not describe the model fully but provide a link or citation, we assessed the cited external documentation for reproducibility. b Only material using open source data contributes +1 point to the reproducibility score. c The material must receive a 'yes' for all subvariables for this variable to contribute +1 point to the reproducibility score.

Table 1b
Systematic review on airline passenger data in infectious disease modelling, (A) fields recorded and (B) criteria used to determine reproducibility of articles and sources types (e.g. passenger numbers and seat capacity) and sources used for the purposes of modelling international infectious disease importation. A secondary aim of the review was to assess the reproducibility of those studies regarding sourcing and use of airline passenger data.

Search strategy
We conducted a search of the literature on 2 October 2017 using PubMed, Web of Science and Scopus with no restriction on the earliest date of the articles returned. A combination of three sets of search terms was used in this review (#1 AND #2 AND #3). The first set (#1) was: 'air' OR 'airline' OR 'aviation' OR 'flight' OR 'airport' OR 'passenger' OR 'transport*' OR 'travel*' AND NOT 'pollution'. The second set (#2) was: 'epidemic' OR 'pandemic'. The final set (#3) was: 'global' OR 'international'. The term 'pollution' was classed as an exclusionary term as initial scoping suggested that a large proportion of results included pollution studies, which were deemed irrelevant to this review.
We included articles if they matched the following inclusion criteria: (i) they were primary and peerreviewed research; (ii) they modelled the international spread of human infectious diseases between at least two countries and (iii) the model was parameterised with airline passenger data. We included modelling studies which considered either dynamic models of the transmission process or non-dynamic modelling of the movement of infected individuals. We also permitted the inclusion of any additional articles if they were identified as the source of passenger data used in already selected articles and met the three inclusion

Table 2c
Systematic review on airline passenger data in infectious disease modelling, list of selected articles with name of data source, information on data validation and reproducibility score (n = 136) criteria above. Although no language restriction was applied to the searches, articles in a language other than English were excluded during the abstract review if no translated version of the abstract could be found. Review articles not containing primary research were also excluded, unless they addressed specifically the use of airline passenger data in epidemic modelling. Articles for which an abstract could not be accessed were excluded at this stage.
Following deduplication, the full list of abstracts and titles was reviewed and included or excluded by at least two reviewers independently. Any disagreement regarding inclusion of an article in the review was then discussed between all reviewers. The full text of selected articles was accessed and screened for relevance in more detail. Articles for which the full text could not be accessed, which were not open access and could not be accessed through the University of Liverpool or Lancaster University library subscriptions, were excluded. The bibliographies of the selected articles were searched for additional relevant articles, based on title and full text, subject to the same inclusion and exclusion criteria.

Data collection strategy
From the final selection of articles, we extracted information regarding the airline passenger data used in each article (Table 1). This information focused on the source, type and validity of data used in the study (Table 1, part A) and the reproducibility of data usage judged by pre-defined criteria (Table 1, part B). For the purposes of this review, data validation was defined as the comparison of primary data used in an article against at least one independent and appropriately comparable set of data. An article was deemed to have validated its data source if it cited another independent and comparable data set and contained a comparison between them. To determine reproducibility, each article was assessed for its reporting of data source using the checklist shown in Table 1, part B and scored accordingly. We did not plan or conduct any bias analysis of the selected publications.

Results
From the 4,012 articles identified in the search, 2,547 were identified as duplicates and rejected, resulting in 1,465 articles which went forward for title and abstract screening (Figure). A further 1,130 were rejected at this stage as they did not meet the inclusion criteria. A total of 335 articles were selected based on their title and abstract and read in full. From these, 223 were rejected: the majority (n = 87) did not contain airline data, 73 were deemed not relevant (did not contain at least two required criteria, such as airline data and model) and 20 used no model. An additional 19 were country-specific, 17 were inaccessible (no access to journal or language barrier), five were reviews and two were not focused on human disease movement. After reading the articles in full, 112 were selected as relevant to this review. Finally, 24 additional articles, not detected by the search but through reading the bibliography of accepted articles, were included after being read in full to determine relevance.
The publication year of the 136 articles selected ranged from 1985 to 2017, with the largest number of articles (n = 17) published in 2016 (Table 2). In the 20 years following the publication by Rvachev and Longini in 1985, the oldest article relevant to this review, only seven relevant articles were published [13][14][15][16][17][18][19].
A wide range of data sources have been used for modelling passenger flow between countries; in total 45 distinct sources were identified (Table 3). Commercial or industry data sources were most often used (14 sources, used in 131 articles), followed by governmental data (14 sources, used in 30 articles). Of the commercial data sources, those most often acknowledged were from the International Air Transport Association (IATA) (61 articles) and OAG, an airline industry company specialising in data provision and analysis (38 articles). Some articles used the airline data directly, however, two articles [17,20] used data from one or more articles (see Table 2) and therefore were also thought of as using industry data. Where a database was named from IATA or OAG sources, OAG MAX was the most common (5 articles). A range of other industry-orientated data sources were cited, including Diio (airline market information), Amadeus (travel reservations database), Feeyo (a Chinese flight scheduler) and OpenFlights.org (an open-access database of flight records contributed by members of the public). Four articles used passenger surveys such as TravelPac from the United Kingdom's (UK) Office for National Statistics (ONS), and nine articles used tourism surveys (Table 3). Eleven articles used information published by airports, and four other sources were reported (the social media site Twitter, two aircraft manufacturers and EuroStat).
According to the set of standards we had established to determine an article's reproducibility (see Table 1, part B), no article was considered fully reproducible. Eight (6%) articles were deemed partially reproducible (score of 3 or above), where some information regarding the description and use of passenger data was reported [13,[50][51][52][53][54][55][56]. Of the 45 total data sources identified, 26 were open source, 11 were closed source, and 8 were not publicly available. The date range of the

Table 3a
Systematic review on airline passenger data in infectious disease modelling, data sources identified in the selected articles, grouped by sector (n = 136 articles) Data source (number of uses; percentage of total uses of any data source) Number of articles using data source a Reference(s)
The majority of articles (n = 115; 85%) were concerned with the global spread of infectious diseases, while the analysis of the airline network itself (while modelling pathogen spread) was the next most common purpose (n = 11; 8%). Five articles used passenger data for descriptive or illustrative purposes [13,29,30,60,61], two articles used the data for passenger screening simulations [62,63] and two articles described the development of a public health tool [23,64]. Of the pathogens modelled, pandemic influenza was the most frequent subject of the models (n = 40; 29%) ( Table 5). Generic models not focussing on a specific pathogen were also common (n = 23; 17%).

Discussion
The purpose of this review was to assess the source and usage of airline passenger data used in mathematical models of international infectious disease spread. A total of 136 articles met the inclusion criteria, from which we identified 45 unique data sources.
The majority of these were sources provided on a commercial basis, e.g. IATA, OAG and the International Civil Aviation Organization (ICAO). These commercial sources provide information from the aviation industry for use within that industry and are marketed as being detailed and accurate. The data resolution can be high: for example, passenger data are available stratified by route (including stopovers), fare class, point of origin and time period. There are often restrictions on the use of the data, in particular non-disclosure agreements regarding the data, collection and retrieval methods, and financial charges apply for access [7]. This type of data is essentially closed data: publicly available but with restricted access. Furthermore, the methodology underpinning data collection is generally undisclosed and it is therefore difficult for researchers to assess the quality, representability and biases of the data. Although these data sources may have a number of subsets representing different data types, authors rarely provide more accurate reporting of the data sets, including name of subsets used and date of access,

Table 3b
Systematic review on airline passenger data in infectious disease modelling, data sources identified in the selected articles, grouped by sector (n = 136 articles) Data source (number of uses; percentage of total uses of any data source) Number of articles using data source a Reference(s)
A number of data sources identified in the review were open-access and include aggregate numbers of passenger published by individual airports, data compiled and released by government agencies (e.g. the UK Office for National Statistics) and information derived from tourism surveys. Although freely available to access, these data sets may not provide the resolution of information required by modelling studies as they typically are limited to passengers departing from or arriving at a specific geographical region or are aggregated over long time periods (annual or quarterly data). In addition, the collection methodology is not always reported for such data sources and there may be biases in the data particularly where reporting is voluntary. Combining information from such sources represents a considerable data challenge.
International travel data describing direct flights only were used more often than those with full itinerary information. Data based on direct flights exclude information on connecting passengers and will therefore underestimate the number of passengers travelling to a specific destination. This limitation is likely to introduce bias, underestimating passenger flow between distant or poorly served locations and overestimating passengers travelling shorter distances [65]. This bias has implications for public health planning as some locations or countries may have an apparent lower risk of importation events because of the lack of direct flights from putative infecting source countries. This may explain the discrepancy during the Ebola epidemic in West Africa in 2014 and 2015, where several studies suggested that the United States (US) was at relatively low risk of importation following the suspension of direct flights. The US did however receive two importations through air travel from the affected area, one was due to a passenger reaching their final destination through indirect flights and the second was a returning healthcare worker [25,66,67].
When considering international travel patterns for public health purposes, accessing information on the number of passengers travelling from an origin to a destination is the most relevant. However, we found that several articles used data for which the unit of measurement was not number of passengers but described passenger traffic in terms of seat capacity -the number of seats on aircraft flying between two specific airports -for which assumptions must be made regarding how full individual flights are and how this may or may not vary with season. In addition, this data type cannot take into account the full routing of a passenger and this information must therefore be inferred from the data or the study needs to state that only direct flights were considered. The variety of data types used for epidemic modelling purposes perhaps reflects the lack of a widely accepted and accessible data source, and this variation in data unit could lead to differences in the conclusions between modelling studies.
for example, if the authors are considering passengers departing or arriving from cities rather than airports but the data were collected at the airport level, the aggregation of passenger numbers from each airport to the city should be acknowledged by the authors. For additional clarity, it would be useful if the authors reported the stage at which the data was aggregated to city level, whether this was part of the original data, or if this was a data manipulation done by the authors. At the time of writing of this review, there was limited understanding of the sensitivity of this level of data (city level) and how it compares to airport-level data and other aggregated data sets, requiring further analytical work. Overall, the majority of articles were deemed to have methods that were not reproducible, and while eight studies were deemed partially reproducible, none were considered to be fully reproducible. It is incumbent on authors to ensure accurate reporting for all aspects of their methodology; our findings suggest that authors of international disease modelling studies should aim to improve their reporting of source and usage of airline passenger data. We advise authors to reference the fields reported in Table 1, part B, at a minimum, when using any data sets.
Data validation is often required to ensure that the collected data are free from biases and an accurate reflection of the subject or process they describe. For airline passenger data, validation is particularly important if the passenger data are sourced from a commercial company with limited or no collection methodology disclosed. Only seven articles reported validation with at least one independent or appropriately comparable set of observations. While there is no acknowledged gold standard data set, governmental open source data, such as those from the US Department of Transport or Travelpac, do at least have published methodology on which potential biases may be identified.
Many pathogens can be relocated through human movement to populations where susceptibility or a lack of awareness may afford a greater incidence and persistence. Most articles reviewed, where a specific pathogen was considered, investigated transmission or importation of viruses. Only three articles were focused on bacteria (Vibrio cholera, Clostridium difficile and Salmonella enterica serotypes Typhi and Paratyphi), despite the known importance of international travel for the global dissemination of antibacterial resistance [70,71] and the capacity of bacteria to initiate epidemics following importation, e.g. the cholera outbreak on Haiti in 2010 [72]. Pandemic influenza was the disease most often considered by the reviewed articles, which perhaps reflects the global significance of pandemic events and the ease with which pandemic strains have spread historically. The other non-influenza viruses noted in these studies have all initiated outbreaks following introduction through international travel. Outbreaks following introduction occurred in South Korea with MERS Co-V [73], in the Portuguese islands of Madeira (off the coast of Western Africa) with dengue virus [74] and in the Caribbean (leading to imported cases in the US) and Italy with chikungunya virus [75,76]. Finally, the accurate modelling of importation risks for specific pathogens may require very high-resolution passenger data, particularly where routes are indirect and the total travel time from origin to destination is important for screening, taking incubation periods into account [77].
To the best of our knowledge, direct comparisons of commercial with open-access data sets, or between commercial data sets, have not yet been accomplished, preventing an informed decision on which data sets are more suitable to represent airline passengers. Although a direct comparison between commercial data sets is likely to be informative for the modelling community, it is also likely to be expensive. In addition, the presence of a single data set that is agreed by the community to be the best representation of international (and national) airline passenger flow would be ideal, although it may be difficult to realise given proprietorial restrictions of certain data sets. The field should aspire to collaborate with industrial data providers to make accurate passenger data available for research, particularly during global public health emergencies.

Strengths and limitations of the review
The screening and selection of articles was done in a systematic manner and by two independent reviewers to ensure all relevant articles were included in the selection of articles to be read in full. The full reference lists of accepted articles were read to find additional relevant articles. Although a number of articles were found when going through reference lists, we are confident that this selection was a good representation of the range of airline data used. In addition, no other review that we are aware of is focused on the analysis of the validity and reproducibility of the data used for mathematical models of infectious disease spread by air travel. Limitations of this study include not contacting authors regarding their methods and not including other search engines which may have yielded additional articles but would also have returned a very large number of potential articles to process. In addition, by limiting the articles to international spread only, some articles which focused primarily on spread within a country were excluded, even though they may include relevant data sources.

Conclusion
We conducted a systematic review to assess the range and reporting of data used by authors to model the international spread of infectious diseases through the airline network. We found 136 articles matching our inclusion criteria and extracted information regarding source, data type, validation assessment and reproducibility. We found a variety of data sources and types used, limited validation performed and poor reporting, rendering many studies unreproducible. We recommend that greater effort is devoted to validation and data sources and that a consensus is achieved on the use of information sources providing airline passenger data. Public health modelling would benefit greatly from the availability of a validated contemporary opensource data source which includes detailed origin-destination information, including connecting passengers, and has high temporal resolution.

*Note added in proof
During editing following acceptance, the authors became aware of a further four articles that satisfied inclusion criteria but were not discoverable using the search algorithm ( (3):222-9). The articles all utilised a previously identified data source (Turism.se) and modelled the travel-related risk of campylobacteriosis, giardiasis, salmonellosis and shigellosis infection. This omission affects quantitative elements of Tables 3 to 5, but does not affect our results and conclusions regarding data sources, nor our overall conclusions and recommendations.