Robert May Prize Shortlisted Article

Post provided by Raphaëlle Momal

Each year Methods in Ecology and Evolution awards the Robert May Prize to the best paper in the journal by an author at the start of their career. Raphaëlle Momal has been shortlisted for her article ‘Tree‐based inference of species interaction networks from abundance data’. In this blog, Raphaëlle discusses how her paper came to be and the applications of the R package developed in her study.

My co-authors and I come from a mathematical background, and are passionate about creating useful and innovative tools for the study of biodiversity. Species association networks are perfect for that. They are really interesting objects, both prominent in ecology and linked to an active mathematical research area. So our first objective for this work was to clearly and precisely transfer the mathematical notions to the ecological world. We quickly discovered that this is easier said than done, as the two disciplines share a lot of vocabulary but with very different meanings! So we actually first had to learn a lot from the ecological literature, which was a very formative and enriching experience.

Species association networks gather a lot of different types of tasks in the literature. To set clear basis for the future, we separate our work from what we call network reconstruction, which is the prediction of links in the network from observed species interactions. We tackle the task of network inference instead, which aims the building of the whole network from observed species census data only. Network inference is about identifying the statistical links (associations) between the species of an ecosystem. This might seem frustrating at first, as we do not have access to the true nature of the ecological interaction between the species (parasitism, trophic relationship, mutualism, etc). However, this formalism allows for a general view on species interactions which can be transposed to other fields, and help unravel interactions that are hard to observe.

Conditional dependence

Between two species, conditional dependence can be understood as the statistical dependence that exists once having controlled for the effect of all other observed species. Conditional dependence thus can only represent direct statistical links between species, which is a big advantage for the sparsity and interpretability of the network. Such network can answer the question “which species would be directly affected by the variation in population of this particular species?”. In studying conditional dependencies, taking environmental effects into account is paramount, as they could create spurious links in the network. Our approach infers the species conditional dependence network, and accounts for both available covariates and experimental offsets. You can download the R package we developed, called EMtree.

EMtree package

EMtree stands for “Expectation-Maximization algorithm for network inference using trees”. This package takes advantage of the Matrix-Tree Theorem to provide an exhaustive and efficient exploration of the space of spanning-tree structures (graphs with no loops and connecting all nodes). This theorem is a powerful algebraic tool of spanning trees, which allows to sum over their whole space at the cost of a determinant calculus. For ten species, it means summing a billion components in just a thousand operations.

The rationale for our approach is to consider a mixture of multivariate Gaussian variables with tree-shaped dependency structures. To accommodate count data, this mixture is included as the latent layer of parameters of a Poisson log-normal density, which also accounts for covariates and offsets. This peculiar law has received a lot of attention recently as a variational estimation algorithm is now available. The Expectation-Maximization (EM) algorithm maximizes the likelihood of a model in presence of latent variables. Thanks to the modelling properties of trees, we were able to derive an exact EM algorithm and compute edges probabilities. To foster robustness of the network, a resampling procedure is added, which can be parallelised and produces edges selection frequencies.

Compared to recent alternative approaches on synthetic data, EMtree shows the best compromise between the amount of false edges, and the maintaining of the original network edge density. We also illustrate the dramatic effects of including covariates in network inference, with empirical ecological survey data.

Improving network inference

We are always looking to further improve the algorithm, and since the publication major upgrades have been made. First, EMtree can now handle big datasets (more than 200 variables) with perfect numerical stability, and the running time is reduced. It now takes about a minute to infer a network with 600 nodes, and 5mins to complete 30 resamples.

As for the quality of the inference, we adapted the concept of stability selection to provide an adaptative threshold, and obtain a network from the edges probabilities. We find that this approach gives a great compromise between false positives and false negatives, and this for all the network sizes. These improvements allow the use of EMtree for the study of microbes interactions, and the human gut microbiota in particular.

This work has also been extended to the interesting case of a missing species in the network. Using a variational EM this time, we are able to get some information about the missing species. Check out our preprint here for more details.

Find out more about the articles that were shortlisted for this year’s Robert May Prize.