Addressing observational biases in data-driven approaches of zoonotic hazard prediction

Post provided by Andrea Tonelli

Over the past five decades, more than half of emerging infectious diseases in humans originated from animals, with zoonotic pathogens posing a growing threat to global health. Shifts in land use, climate change, direct use of wildlife and biodiversity loss all influence human exposure to pathogens of wild animals, shaping the likelihood of zoonotic spillover events. In the wake of COVID-19, understanding host-pathogen interactions and the mechanisms driving pathogen spillover has become one of the defining challenges of our time.

Given the uncertainty regarding the distribution of pathogens of animals globally, a growing interest has been devoted into developing innovative, data-driven approaches that could complement traditional surveillance methods for zoonotic spillover prevention and make up for some of the crucial knowledge gaps. Machine learning and predictive modelling have gained traction for identifying the host range of pathogens – that is the spectrum of different species that a pathogen can infect – to identify potential host species and pinpoint targets for zoonotic risk surveillance.

Figure 1 Scientists capturing flying foxes in Cameroon. Periodical surveillance activities are carried out in bat populations residing near human settlements where people might be exposed to viral spillover. Image credits: ©Jean-François Lagrot

These models learn from biological and ecological characteristics of known hosts, and if not used carefully, they are prone to replicating existing biases in host-pathogen associations. Indeed, research effort has historically focused on particular animal and pathogen taxa, leading to an incomplete and skewed understanding of pathogen distribution across host species. For example, the discovery of bat-associated viruses experienced an unprecedented upward trend following the emergence of SARS-CoV in 2002. Bats are known to host the greatest diversity of viruses among mammals, however, it is difficult to decouple the effect of research effort from that of the biological and ecological characteristics that may make bats more prone to hosting a higher diversity of viruses.

In our article, we made a step towards accounting for well-known biases that affect viral sampling in machine learning models for host prediction. First, we classified positive species in low-evidence and high-evidence hosts, where the latter are hosts for which observational studies suggest a potential ability of the species to maintain the pathogen in the environment. Our case study on betacoronaviruses included microbats, fruit bats, insectivores, and rodents as high-evidence hosts, which were treated in the analysis so as to give a higher contribution to model predictions, compared to low-evidence hosts. Additionally, data on host-pathogen associations don’t usually come with negatives (that is, species that cannot be infected by a given pathogen), so we introduced the concept of pseudo-negatives. We picked pseudo-negative species among those that were likely to have undergone virological sampling but have no documented associations with the target virus, leveraging on patterns of taxonomic proximity and geographic overlap with sampled species. Finally, in order to estimate the expected accuracy of our framework we tested our model on new positives – that is, true positive species that were sampled after the publication of the host-virus network that we analysed, and therefore did not enter model training. For this independent set of new positive species, the average predicted probability of being a host was more than 35% higher than that of still unknown hosts, which suggests that our modelling framework is indeed able to correctly identify currently unsampled hosts.

Figure 2 Graphical summary of the modelling framework described in our article. The framework has been developed to predict hosts of target pathogens while accounting for different degrees of evidence and uncertainty in the definition and prediction of host status.

Future steps

As analytical tools continue to evolve, and higher quality data becomes increasingly more available, integrating machine learning into zoonotic risk assessment has the potential to benefit disease surveillance efforts, potentially supporting more efficient and proactive responses to emerging infectious threats. With the basics of our framework down, we are currently working on applying our methods to a higher variety of viruses with zoonotic potential that may pose a threat to public health. If you are passionate on the macroecology of infectious diseases and the ways of anticipating zoonotic risk, or have thoughts about our article and potential future steps, I would love to hear from you! You can get in touch with me at ndrtonelli@gmail.com.

Read the full article here.

Leave a comment