Volume 30, Number 9—September 2024
Research Letter
Optimizing Disease Outbreak Forecast Ensembles
Abstract
On the basis of historical influenza and COVID-19 forecasts, we found that more than 3 forecast models are needed to ensure robust ensemble accuracy. Additional models can improve ensemble performance, but with diminishing accuracy returns. This understanding will assist with the design of current and future collaborative infectious disease forecasting efforts.
Real-time collaborative forecast efforts have become the standard to generate and evaluate forecasts for infectious disease outbreaks (1,2). Individual forecasts are aggregated into an ensemble prediction that has historically outperformed individual models and is the primary external communication used (3–5). Because of the focus on the singular ensemble model and the costs associated with producing individual forecasts, public health officials starting or maintaining a forecast hub face 2 key challenges: identifying target participation rates and optimizing ensemble performance of participating models. To guide this decision-making, we analyzed data from recent US-based collaborative outbreak forecast hubs to identify how the size and composition of an ensemble influences performance.
We analyzed hub forecasts for influenza-like illness (ILI) from 2010–2017 (5); for COVID-19 reported cases, hospital admissions, and deaths from 2020–2023 (6); and for influenza hospital admissions from 2021–2023 (7). For each hub, we identified time periods with maximal model participation that had at least 2 increasing and 2 decreasing epidemiologic phases and obtained forecasts for individual models that produced >90% of all possible forecasts (Appendix Table 1, Figure 1). For each ensemble size, nD ∈ {1, … ,ND}, where ND is the disease-specific total number of models matching our inclusion criteria; we created unweighted ensemble forecasts for every combination of individual models of size nD. We followed the hub forecast methodologies and made probabilistic forecasts for ILI by using a linear pool methodology (5), and we made quantile forecasts for all others by taking the median across all individual forecasts (Figure 1) (8). For each hub, we compared the ensemble performance against 2 hub-produced models. The first is a baseline model that produces naive forecasts and serves as a skill reference point; and the second is the published ensemble produced in real-time that is an unweighted ensemble of all submitted forecasts and is the current standard for performance (3,5). We summarized probabilistic ensemble forecast skill by using the log score for ILI forecasts and the weighted interval score for all others (9,10). We took the reciprocal of the log score so that lower values would indicate better performance similar to the weighted interval score (Appendix).
Looking across all ensemble sizes and combinations, we found that including more models improved average forecast performance and that all ensembles composed of >3 models outperformed the baseline model (Figure 2). Further increases to the ensemble size slightly improved the average forecast performance, but substantially decreased the variability of performance across ensembles. When we increased the ensemble size of influenza hospital admission forecasts from 4 to 7, the average performance improved by 2%, but the interquartile range decreased by 56.5%. Increasing the ensemble size therefore reduces the variability in expected performance of an ensemble.
To assist with decision-making regarding optimal ensemble assembly, we tested 2 approaches for model selection on the basis of past performance. We either ranked models by their individual performance and chose the top nD models (individual rank) or we compared the performance of all ensemble combinations of size nD and chose the models from the top performing ensemble (ensemble rank). Across all hubs, the individual rank methodology outperformed randomly assembled ensembles of the same size 63% (range 33.1%–87.2%) of the time, and the ensemble rank methodology outperformed randomly assembled ensembles of the same size 87.9% (range 70.9%–99.7%) of the time (Appendix Table 2, Figure 2). Performance of those ensembles is similar during both the training and testing periods, suggesting that ensemble performance is consistent through time (Appendix Figures 2, 3). Overall, ensemble rank outperforms individual rank for ensemble construction for 89.8% (range 66.7%–100%) of all sizes, and it provides a 6.1% (range 1.3%–11.9%) skill improvement (Appendix Table 2). The size 4 ensemble rank performed similarly to the published hub ensemble, although performance often declined with additional models (Appendix Figures 2, 3). Relative forecast performance across ensemble strategies was consistent when stratified by the ensemble size, forecast location, forecast date and phase of the epidemic, forecast target, and the skill metric (Appendix Figures 4–18).
Our results provide guidance for future collaborative forecast efforts. Hub organizers should target a minimum of 4 validated forecast models to ensure robust performance compared with baseline models. Adding more models reduces the variability in expected ensemble performance but might come with diminishing returns in average forecast skill. Organizers should use past ensemble performance rather than individual performance when selecting models to include in forecast ensembles; it is likely that further gains and different relationships between ensemble size and performance will come from weighted ensemble approaches (8). As public health officials and researchers look to expand collaborative forecast efforts, and as funding agencies allocate budgets across methodological and applied forecast efforts, our results can be used to identify target participation rates, assemble appropriate forecast models, and further improve ensemble forecast performance.
Dr. Fox is an assistant professor at the University of Georgia in the department of epidemiology and biostatistics and the Institute of Bioinformatics. His research interests include statistical modeling of emerging infectious diseases and outbreak forecasting.
Acknowledgments
We thank the Council for State and Territorial Epidemiologists, Centers for Disease Control and Prevention, the Models of Infectious Disease Agent Study forecasting working groups, and the Scenario Modeling Hub. We also thank the Texas Advanced Computing Center at The University of Texas at Austin for providing high performance computing resources.
Funding for S.J.F. and L.A.M. was provided by the Council for State and Territorial Epidemiologists (grant no. NU38OT000297) and the Centers for Disease Control and Prevention (grant no. 75D30122C14776). Funding for M.K., E.L.R., and N.G.R. was provided by the National Institutes of General Medical Sciences (grant no. R35GM119582) and the Centers for Disease Control and Prevention (grant no. 1U01IP001122).
References
- Reich NG, Lessler J, Funk S, Viboud C, Vespignani A, Tibshirani RJ, et al. Collaborative hubs: making the most of predictive epidemic modeling. Am J Public Health. 2022;112:839–42. DOIPubMedGoogle Scholar
- Biggerstaff M, Alper D, Dredze M, Fox S, Fung ICH, Hickmann KS, et al.; Influenza Forecasting Contest Working Group. Results from the centers for disease control and prevention’s predict the 2013-2014 Influenza Season Challenge. BMC Infect Dis. 2016;16:357. DOIPubMedGoogle Scholar
- Cramer EY, Ray EL, Lopez VK, Bracher J, Brennen A, Castro Rivadeneira AJ, et al. Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States. Proc Natl Acad Sci U S A. 2022;119:
e2113561119 . DOIPubMedGoogle Scholar - Lutz CS, Huynh MP, Schroeder M, Anyatonwu S, Dahlgren FS, Danyluk G, et al. Applying infectious disease forecasting to public health: a path forward using influenza forecasting examples. BMC Public Health. 2019;19:1659. DOIPubMedGoogle Scholar
- Reich NG, Brooks LC, Fox SJ, Kandula S, McGowan CJ, Moore E, et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. Proc Natl Acad Sci U S A. 2019;116:3146–54. DOIPubMedGoogle Scholar
- Cramer EY, Huang Y, Wang Y, Ray EL, Cornell M, Bracher J, et al.; US COVID-19 Forecast Hub Consortium. The United States COVID-19 forecast hub dataset. Sci Data. 2022;9:462. DOIPubMedGoogle Scholar
- FluSight forecast -data 2022–2023 [cited 2023 Jul 12]. https://github.com/cdcepi/Flusight-forecast-data
- Ray EL, Brooks LC, Bien J, Biggerstaff M, Bosse NI, Bracher J, et al. Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States. Int J Forecast. 2023;39:1366–83. DOIPubMedGoogle Scholar
- Bracher J, Ray EL, Gneiting T, Reich NG. Evaluating epidemic forecasts in an interval format. PLOS Comput Biol. 2021;17:
e1008618 . DOIPubMedGoogle Scholar - Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102:359–78. DOIGoogle Scholar
Figures
Cite This ArticleOriginal Publication Date: August 14, 2024
Table of Contents – Volume 30, Number 9—September 2024
EID Search Options |
---|
Advanced Article Search – Search articles by author and/or keyword. |
Articles by Country Search – Search articles by the topic country. |
Article Type Search – Search articles by article type and issue. |
Please use the form below to submit correspondence to the authors or contact them at the following address:
Spencer Fox, University of Georgia, 120 B.S. Miller Hall, Health Sciences Campus, 101 Buck Rd, Athens, GA 30602, USA
Top