Science-Watching: Forecasting New Diseases in Low-Data Settings Using Transfer Learning

[from London Mathematical Laboratory]

by Kirstin Roster, Colm Connaughton & Francisco A. Rodrigues

Abstract

Recent infectious disease outbreaks, such as the COVID-19 pandemic and the Zika epidemic in Brazil, have demonstrated both the importance and difficulty of accurately forecasting novel infectious diseases. When new diseases first emerge, we have little knowledge of the transmission process, the level and duration of immunity to reinfection, or other parameters required to build realistic epidemiological models. Time series forecasts and machine learning, while less reliant on assumptions about the disease, require large amounts of data that are also not available in early stages of an outbreak. In this study, we examine how knowledge of related diseases can help make predictions of new diseases in data-scarce environments using transfer learning. We implement both an empirical and a synthetic approach. Using data from Brazil, we compare how well different machine learning models transfer knowledge between two different dataset pairs: case counts of (i) dengue and Zika, and (ii) influenza and COVID-19. In the synthetic analysis, we generate data with an SIR model using different transmission and recovery rates, and then compare the effectiveness of different transfer learning methods. We find that transfer learning offers the potential to improve predictions, even beyond a model based on data from the target disease, though the appropriate source disease must be chosen carefully. While imperfect, these models offer an additional input for decision makers for pandemic response.

Introduction

Epidemic models can be divided into two broad categories: data-driven models aim to fit an epidemic curve to past data in order to make predictions about the future; mechanistic models simulate scenarios based on different underlying assumptions, such as varying contact rates or vaccine effectiveness. Both model types aid in the public health response: forecasts serve as an early warning system of an outbreak in the near future, while mechanistic models help us better understand the causes of spread and potential remedial interventions to prevent further infections. Many different data-driven and mechanistic models were proposed during the early stages of the COVID-19 pandemic and informed decision-making with varying levels of success. This range of predictive performance underscores both the difficulty and importance of epidemic forecasting, especially early in an outbreak. Yet the COVID-19 pandemic also led to unprecedented levels of data-sharing and collaboration across disciplines, so that several novel approaches to epidemic forecasting continue to be explored, including models that incorporate machine learning and real-time big data data streams. In addition to the COVID-19 pandemic, recent infectious disease outbreaks include Zika virus in Brazil in 2015, Ebola virus in West Africa in 2014–16, Middle East respiratory syndrome (MERS) in 2012, and coronavirus associated with severe acute respiratory syndrome (SARS-CoV) in 2003. This trajectory suggests that further improvements to epidemic forecasting will be important for global public health. Exploring the value of new methodologies can help broaden the modeler’s toolkit to prepare for the next outbreak. In this study, we consider the role of transfer learning for pandemic response.

Transfer learning refers to a collection of techniques that apply knowledge from one prediction problem to solve another, often using machine learning and with many recent applications in domains such as computer vision and natural language processing. Transfer learning leverages a model trained to execute a particular task in a particular domain, in order to perform a different task or extrapolate to a different domain. This allows the model to learn the new task with less data than would normally be required, and is therefore well-suited to data-scarce prediction problems. The underlying idea is that skills developed in one task, for example the features that are relevant to recognize human faces in images, may be useful in other situations, such as classification of emotions from facial expressions. Similarly, there may be shared features in the patterns of observed cases among similar diseases.

The value of transfer learning for the study of infectious diseases is relatively under-explored. The majority of existing studies on diseases remain in the domain of computer vision and leverage pre-trained neural networks to make diagnoses from medical images, such as retinal diseases, dental diseases, or COVID-19. Coelho and colleagues (2020) explore the potential of transfer learning for disease forecasts. They train a Long Short-Term Memory (LSTM) neural network on dengue fever time series and make forecasts directly for two other mosquito-borne diseases, Zika and Chikungunya, in two Brazilian cities. Even without any data on the two target diseases, their model achieves high prediction accuracy four weeks ahead. Gautam (2021) uses COVID-19 data from Italy and the USA to build an LSTM transfer model that predicts COVID-19 cases in countries that experienced a later pandemic onset.

These studies provide empirical evidence that transfer learning may be a valuable tool for epidemic forecasting in low-data situations, though research is still limited. In this study, we aim to contribute to this empirical literature not only by comparing different types of knowledge transfer and forecasting algorithms, but also by considering two different pairs of endemic and novel diseases observed in Brazilian cities, specifically (i) dengue and Zika, and (ii) influenza and COVID-19. With an additional analysis on simulated time series, we hope to provide theoretical guidance on the selection of appropriate disease pairs, by better understanding how different characteristics of the source and target diseases affect the viability of transfer learning.

Zika and COVID-19 are two recent examples of novel emerging diseases. Brazil experienced a Zika epidemic in 2015–16 and the WHO declared a public health emergency of global concern in February 2016. Zika is caused by an arbovirus spread primarily by mosquitoes, though other transmission methods, including congenital and sexual have also been observed. Zika belongs to the family of viral hemorrhagic fevers and symptoms of infection share some commonalities with other mosquito-borne arboviruses, such as yellow fever, dengue fever, or chikungunya. Illness tends to be asymptomatic or mild but can lead to complications, including microcephaly and other brain defects in the case of congenital transmission.

Given the similarity of the pathogen and primary transmission route, dengue fever is an appropriate choice of source disease for Zika forecasting. Not only does the shared mosquito vector result in similar seasonal patterns of annual outbreaks, but consistent, geographically and temporally granular data on dengue cases is available publicly via the open data initiative of the Brazilian government.

COVID-19 is an acute respiratory infection caused by the novel coronavirus SARS-CoV-2, which was first detected in Wuhan, China, in 2019. It is transmitted directly between humans via airborne respiratory droplets and particles. Symptoms range from mild to severe and may affect the respiratory tract and central nervous system. Several variants of the virus have emerged, which differ in their severity, transmissibility, and level of immune evasion.

Influenza is also a contagious respiratory disease that is spread primarily via respiratory droplets. Infection with the influenza virus also follows patterns of human contact and seasonality. There are two types of influenza (A and B) and new strains of each type emerge regularly. Given the similarity in transmission routes and to a lesser extent in clinical manifestations, influenza is chosen as the source disease for knowledge transfer to model COVID-19.

For each of these disease pairs, we collect time series data from Brazilian cities. Data on the target disease from half the cities is retained for testing. To ensure comparability, the test set is the same for all models. Using this empirical data, as well as the simulated time series, we implement the following transfer models to make predictions.

  • Random forest: First, we implement a random forest model which was recently found to capture well the time series characteristics of dengue in Brazil. We use this model to make predictions for Zika without re-training. We also train a random forest model on influenza data to make predictions for COVID-19. This is a direct transfer method, where models are trained only on data from the source disease.
  • Random forest with TrAdaBoost: We then incorporate data from the target disease (i.e., Zika and COVID-19) using the TrAdaBoost algorithm together with the random forest model. This is an instance-based transfer learning method, which selects relevant examples from the source disease to improve predictions on the target disease.
  • Neural network: The second machine learning algorithm we deploy is a feed-forward neural network, which is first trained on data of the endemic disease (dengue/influenza) and applied directly to forecast the new disease.
  • Neural network with re-training and fine-tuning: We then retrain only the last layer of the neural network using data from the new disease and make predictions on the test set. Finally, we fine-tune all the layers’ parameters using a small learning rate and low number of epochs. These models are examples of parameter-based transfer methods, since they leverage the weights generated by the source disease model to accelerate and improve learning in the target disease model.
  • Aspirational baseline: We compare these transfer methods to a model trained only on the target disease (Zika/COVID-19) without any data on the source disease. Specifically, we use half the cities in the target dataset for training and the other half for testing. This gives a benchmark of the performance in a large-data scenario, which would occur after a longer period of disease surveillance.

The remainder of this paper is organized as follows. The models are described in more technical detail in Section 2. Section 3 shows the results of the synthetic and empirical predictions. Finally, Section 4 discusses practical implications of the analyses.

Access the full paper [via institutional access or paid download].