Post provided by Rob J. Boyd
Colleagues and I recently published a paper in MEE, and its title might induce a bit of head scratching: “Using causal diagrams … to correct geographic sampling biases in biodiversity monitoring data” (Boyd et al., 2025). If you’re familiar with causal inference, you might be wondering, “What have causal diagrams got to do with sampling biases?” And if you’re new to the concept, the title probably doesn’t make much sense at all. Let me explain…
Most people understand the word “bias” to imply unfairness, prejudice or favouritism. As a fan of Newcastle United Football Club, I find myself accusing Premier League referees of bias whenever things don’t go our way. But there are of course more serious (and genuine) cases of bias—say, when Large Language Models proliferate the views of WEIRD (Western, Educated, Industrialized, Rich, and Democratic) humans, since those humans contributed disproportionately to the published literature on which the models were trained (Abdurahman et al., 2024).
In survey sampling, there exists a special kind of bias called sampling bias. The precise definition of sampling bias has been quibbled over by statisticians, but to me it is most satisfyingly expressed in terms of a correlation. Imagine that we are interested in a species’ abundance across a set of sites and that only a fraction of those sites has been sampled. The correlation between the probability of being included in the sample and the species’ abundance across sites neatly captures the concept of favouritism in that it indicates whether certain values of abundance are likely to have been preferentially sampled. Indeed, an analogous definition of sampling bias, based on the correlation between whether and how people respond to opinion polls, is starting to gain traction in political science (Bailey, 2023).
A major advantage of defining sampling bias as a correlation is that it becomes clear how to mitigate its effects. It is quite well known that the correlation between two variables can be reduced by holding certain other variables constant (formally known as “conditioning on” them). In the example given above, there is a sampling bias if the probability of sites being included in the sample is correlated with the species’ abundance. But we might find that by focusing on sites within protected areas, the correlation disappears. In this case, we have conditioned on one level of protected area status (inside) and in doing so eliminated the sampling bias. The question is how to identify the variables that exhibit this bias-mitigating property once conditioned on (again, think held constant).
An answer to this question can found in a subdiscipline of statistics known as causal inference, where analysts face a similar challenge. To isolate the causal effect of one variable on another, variables that induce a spurious correlation between the two must be identified and conditioned on (last time: held constant). The way that causal inference folk tackle this problem is by constructing “causal diagrams” depicting their assumptions about the causal links between variables in the system. Variables are represented by “nodes”, and causal effects are denoted using arrows. The arrangement of the arrows and nodes can be translated into formal statements about dependencies among variables in the causal diagram and which variables need to be conditioned on to break them.
Now for the clever part. If we include the probability of being included in the sample in our causal diagram, then we can work out which variables must be conditioned on to render it independent of—and therefore uncorrelated with—our variable of interest. That is, we can use the diagram to identify the variables that, once conditioned on, eliminate the sampling bias!
Naturally, there is a catch. Constructing realistic causal diagrams is hard. It requires knowledge of the causes and effects of sample inclusion and the variable that is being studied (abundance in the example above). Even indirect causes and effects that are mediated through other variables might turn out to be important. In my view and that of my co-authors, the best way to construct realistic causal diagrams is to seek input from local taxon and dataset experts. The optimal way to engage with the experts and to solicit and assemble their feedback is probably context specific.

Check out the paper!
If you have found this blog post interesting, then please do check out our paper in MEE (Boyd et al., 2025). The paper essentially does two things. The first is to formalise the concepts I have introduced here. The second is to demonstrate how these concepts can be translated into an operational method, using data on the abundances of two species of butterfly from the UK Butterfly Monitoring Scheme as an example. See the figure below for a conceptual overview.
Abdurahman, S., Atari, M., Karimi-Malekabadi, F., Xue, M. J., Trager, J., Park, P. S., Golazizian, P., Omrani, A., & Dehghani, M. (2024). Perils and opportunities in using large language models in psychological research. PNAS Nexus, 3(7). https://doi.org/10.1093/pnasnexus/pgae245
Bailey, M. A. (2023). A New Paradigm for Polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Boyd, R. J., Botham, M., Dennis, E., Fox, R., Harrower, C., Middlebrook, I., Roy, D. B., & Pescott, O. L. (2025). Using causal diagrams and superpopulation models to correct geographic biases in biodiversity monitoring data. Methods in Ecology and Evolution. https://doi.org/10.1111/2041-210X.14492