Post provided by Jonathan S. Lefcheck

Nature is complicated. As a scientist, you might say, “Well, duh,” but as students of nature, this complexity is probably the single greatest challenge we must face in trying to dissect the hows and whys of the natural world.

History is a Set of Lies Agreed Upon: Moving beyond ANOVA

For a long time, we tried to strip this complexity away by conducting very controlled experiments adhering to rigid designs. The ‘two-way fully-crossed analysis of variance’ will be familiar to anyone who has taken even the most basic stats class, because, for many decades, it was the gold standard for any experiment.

It might be tough to manipulate this whole reef.

The problem is: the real world doesn’t adhere to an ANOVA design. By this, I mean that by their very nature, manipulative experiments are artificial. It’s hard—if not impossible—to manipulate an entire forest or a coral reef, and as such, we retreat to more tractable, smaller investigations. There is certainly a lot of value in determining whether the phenomenon can occur, but these tightly regulated designs say nothing about whether they are likely to occur, particularly at the scales most relevant to humanity.

To get at the latter point, we must leave the safety of the greenhouse. However, our trusty ANOVA toolbox isn’t very useful anymore, because real-world data often violate the most basic statistical assumptions, not to mention the presence of numerous additional influences that may drive spurious relationships.

The advent of new statistical methods, such as incorporation of non-linear errors and modelling of non-independence or non-constant variance, solve many of the shortcomings of ANOVA. But these methods still have two major limitations.

First, everyone knows the old mantra “Correlation does not imply causation.” These methods still assume correlative relationships, even though we often talk about them like they are causative.


Second, these methods assume a direct relationship between the independent variables and the outcome. By this, I mean that by saying Y is directly caused by X, models are not often very mechanistic, since it’s rare that Y is directly caused by X. Think about the classic trophic cascade: predators do not actually interact with the primary producer. Their relationship is instead mediated by herbivores, which both control the resource and act as prey for the predators.

It’s often not possible to talk about causation because it’s unclear whether X causes Y, Y causes X, or both are responding to some other factor (think: the herbivores).

You Must Unlearn What You Have Learned: Inferring Causation

A classic salt marsh trophic cascade, where crabs eat snails, and snails eat marsh grass. From Silliman & Bertness 2002 PNAS. Illustrated by Jane K. Neron.

In the early 2000s, Bill Shipley wrote a book called Cause and Correlation in Biology where he argued that, actually, you can infer causation if you can use existing knowledge of the system to restrict the alternatives.

Take this salt marsh trophic cascade for example. We know from a simple examination of the organisms and their habits that primary producers are not eating herbivores or predators. It’s just not how the system works. We can arrive at the similar conclusion, that herbivores are not eating predators, but that it’s the other way around. So, we’ve eliminated both the potential for the cascade to flow in the opposite direction (plants eating herbivores), as well as identified logical mediator between predators and primary producers (herbivores).

The problem is that, until recently, we didn’t have a way to quantitatively evaluate the cascading relationship. We could run a single multiple regression, or two simple regressions, but these are both inaccurate or inelegant ways to describe the phenomenon.

Structural Equation Modelling

Enter: structural equation modelling (or SEM for short) – a method that strings together variables and evaluates them in a single causal network. Let’s break this definition down.

First, ‘string together variables’ means that variables can act as both responses and predictors. So we can model cascading or indirect effects. It also means that multiple hypotheses – herbivores eat plants, predators each herbivores, predators affect plants by removing herbivores – can be tested simultaneously.

You’re unlikely to see a plant eating a giraffe.

A single causal network’ implies that the relationships are causative, i.e. that they are directional. Herbivores eat plants. Plants don’t eat herbivores. Along with the complete network of variables, which can include mediating influences or other constraints on the relationships of interest (such as soil nutrients), the imposition of directionality leads to inferences that can be deemed truly casual.

These two ideas break with traditional inferential statistics, but really open the door in terms of the types of insight that can be gained. For example, by implementing statistical rather than experimental controls, SEM can be used to evaluate all types of data, from simple bucket experiments to global observational surveys.

The Way of Progress is Swift and Easy: Making SEM Accessible to Everyone

Early iterations of SEM required multivariate normality and independence, like many other statistical tests. New extensions introduced over the last decade – referred to as generalised confirmatory path analysis or piecewise SEM – relaxed these assumptions, so it’s now possible to fit models to a variety of non-normal distributions and model non-independence through specification of fixed correlation structures or even random effects.

The problem arises in the fitting of these models, which is cumbersome to do by hand, particularly as models increase in complexity. Existing software to automate SEM doesn’t allow for the increased flexibility of these new methods, and have specialised and sometimes esoteric syntax.

R has emerged as the preeminent statistical software in ecology and evolution, and for good reason: it’s free, adaptable, and there is a tremendous support base. Through packages, the base functionality of R can be expanded to fit a variety of model forms.

In my paper, ‘piecewiseSEM: Piecewise structural equation modelling in R for ecology, evolution, and systematics‘, I introduce a package (piecewiseSEM) that allows simple fitting and evaluation of structural equation models using R packages and syntax familiar to many ecologists and evolutionary biologists, such as nlme and lme4.

It’s as simple as defining the causal network, breaking it into its component models, and coding those models in R. The package then takes the list of individual models and conducts goodness-of-fit tests, reports (standardised) coefficients, and yields other important information, such as R2 values and partial correlations and predictions. The package is also fully documented with help files and example dataets.

I’m actively developing new methods to further extend the utility of piecewise SEM and the piecewiseSEM package, including generalisation of indirect effects to non-linear variables, and the inclusion of both latent and composite variables. The package is currently being rewritten for version 2.0 to bring it to even closer parity to existing R functions, including the use of summary and update functions. You can check out continued development on GitHub, but there is certainly much more in store for piecewise SEM…

To find out more about piecewiseSEM, read our Methods in Ecology and Evolution article ‘piecewiseSEM: Piecewise structural equation modelling in R for ecology, evolution, and systematics‘.

This paper was highly commended in the 2016 Robert May Early Career Research Award. It will be freely available in the BES Early Career Researcher Awards Virtual Issue for a limited time.