Post provided by Matilda Brown, Barbara Holland, and Greg Jordan
There are many reasons that we might be interested in whether individuals, species or populations overlap in multidimensional space. In ecology and evolution, we might be interested in climatic overlap, morphological overlap, phenological or biochemical overlap. We can use analyses of overlap to study resource partitioning, evolutionary histories and palaeoenvironmental conditions, or to inform conservation management and taxonomy. Even these represent only a subset of the possible cases in which we might want to investigate overlap between entities. Databases such as GBIF, TRY and WorldClim make vast amounts of data publicly available for these investigations. However, these studies require complex multivariate data and distilling such data into meaningful conclusions is no walk in the park.
Points in (Hyper)space
When studying problems relating to multidimensional overlap, often the only data we have are individual point records, each with measured variables. So we can think of multivariate observations as points in multidimensional space (i.e. hyperspace), where each variable is an axis or dimension. In geographic space, these dimensions are latitude and longitude but in ’biological space’ (Soberon and Peterson give an excellent explanation of this concept in Ecological Niches and Geographic Distributions) they might be climatic variables (e.g. mean annual temperature, mean annual precipitation), morphological variables (e.g. beak length, beak width) or functional traits. When we analyse overlap, we’re looking for regions of this space occupied by both entities.
For example, Microcachrys is currently endemic to the mountains of Tasmania, but occurs with tropical species in the fossil record. By studying climatic overlap (or non-overlap) with the living relatives of these fossils, we might be able to explain this anomaly.
The HYPEROVERLAP Approach
We can use a machine learning approach to test whether pairs of multivariate datasets occupy overlapping or non-overlapping areas of hyperspace. If we can find a boundary that separates the points (i.e. if we can draw a line between them), then they don’t overlap. We can train a support vector machine (SVM; a machine-learning classifier) to find this boundary. The SVM finds the optimal boundary between two sets of points and then checks if the original observations are on the ’correct’ side of the line. If the SVM finds a line that separates the observations perfectly, then there’s no overlap. If it’s impossible to draw a line without points on the ’wrong side’, then we’d infer overlap.
The HYPEROVERLAP Package
We developed the R HYPEROVERLAP package to provide simple tools for identifying and inspecting overlap in an arbitrary number of dimensions. The basic function is hyperoverlap_detect. This function identifies the optimal boundary using support vector machines from the e1071 package. It then returns the result (overlap or non-overlap), the number of points on the wrong side of the boundary (this can also be used for targeted data-cleaning) and the model describing the decision boundary.
The shape of this boundary is set using the arguments kernel and kernel.degree; if kernel = linear, the boundary must be a straight line, but setting kernel = polynomial allows this boundary to be curved. In theory, we could use more complex kernels (you can learn more about kernel functions here), but they are likely to produce boundaries with no biological meaning (i.e. overfitting).
To run hyperoverlap, data should be in the same form as the iris dataset (included with R): variable names as column names, species/entity ID as a variable (i.e. not row names). If there are more than two entities in the data you have two options:
- Filter the input so that it only contains the two entities of interest
- Use hyperoverlap_set instead, which runs hyperoverlap_detect pairwise on the set of entities
Observations with missing values in any dimension should be removed – because hyperoverlap places points in space, all variables are treated together. The values of each variable should be comparable. We suggest transforming each variable to an approximately normal distribution and standardising each dimension (e.g. using scale). Note: when transforming/scaling, we recommend using the global values for each variable, not the values for each entity or entity pair. Blonder gives a great explanation of why transformation is recommended in Hypervolume concepts in niche‐ and trait‐based ecology.
In two or three dimensions, we can easily visually inspect the SVM decision boundary. In higher dimensions, we can use linear discriminant analysis to visualise the data in a brain-friendly number of dimensions. This lets us eyeball the result or to check for potentially erroneous points. The different visualisation functions are illustrated below – check out the package vignette for details (and for interactive 3D plots!).
For anyone wondering about the ’residualPCA’ axes: for two-class data, linear discriminant analysis (LDA) returns a single axis which best separates the two classes and is also (in most cases) approximately perpendicular to the decision boundary. As a visualisation tool, we weren’t happy with a one-dimensional output, so the hyperoverlap_lda function runs a principle components analysis on the residual dimensions (hence the ’residual’ part of the name). This allows us to plot the LDA result in two or three dimensions.
Analysing Multiple Entities
The package also provides built-in functionality to run hyperoverlap_detect on a set of entities, using hyperoverlap_set and hyperoverlap_pairs_plot, as in the example below.
Where to Next for HYPEROVERLAP?
So far, we’ve focused on overlap detection of overlap/non-overlap, but the concepts underpinning HYPEROVERLAP can also be extended to look at other qualitative relationships. In some cases, the SVM doesn’t find a boundary at all, suggesting there is no region of space uniquely occupied by one species or population (i.e. the space occupied by blue falls entirely within the space occupied by red). This might be useful in studies of biological changes, including phenological shifts and range expansion of invasive species. Although this conceptual extension has not been tested, it is a promising avenue for further research and potential inclusion in future versions of the HYPEROVERLAP R package.
To find out more about HYPEROVERLAP, check out our Methods in Ecology and Evolution article, ‘Hyperoverlap: detecting biological overlap in n-dimensional spaces’. The R package can be downloaded from GitHub here.
This article is the first to be published under our Transparent Peer Review Trial. You can read all of the reviews and decision letters for this paper on Publons.
Read the other Robert May Prize 2020 shortlisted articles here.