Post provided by Renato Lima

Many biodiversity studies, covering a wide range of goals, need species records. These records are becoming readily available online, however there is minimal standardisation for these records at this stage, therefore requiring final users to spend a significant amount of time formatting records prior to using data. To overcome this, Renato Lima et al. have created plantR – an open-source package that provides a comprehensive toolbox to manage species records from biological collections. In this blog post, Renato discusses the workflow of the package and describes how this package can help researchers better assess data quality and avoid data leakage.

In late 2018, I found myself in need of species record information for a project on the endemism and conservation status of the Atlantic Forest tree flora (in collaboration with Hans ter Steege). My first idea was simple: download data from online repositories (e.g., Global Biodiversity Information Facility – GBIF) and do the analyses. Right? Not exactly.

Data repositories such as GBIF make available invaluable species information from thousands of collections across the globe but most of the species’ records are not ready to use. There are vast differences in the way information is provided, much important information is missing (e.g., geographical coordinates), and it is often hard to know how reliable the available information really is (e.g., species identifications). Removing all possible problems will lead to data leakage; using all data irrespectively of their quality can bias the study outcomes.

Amazed by the number of records that would not be usable in my specific study (about 80% of all records), I decided to clean up the data myself. I had no idea of the effort and time this decision would take, but gladly, we don’t work alone. In early 2019, I met with Marinez de Siqueira, Andrea Sánchez-Tapia and Sara Mortara, and we quickly realized that we were doing similar things. We decided to collaborate on creating procedures and tools to manage species records. The idea grew more and more and resulted in a new R package called ‘plantR’, described in a paper recently published in Methods in Ecology & Evolution.

The package

plantR was designed to help data providers, managers and final users to standardise and validate species records. At first, it largely reflected our professional backgrounds (i.e., plant ecologists and conservationists), but today the package provides tools that can be used by taxonomists and collection managers as well. This package can be used by those curating collections, conducting taxonomic reviews, and many sorts of ecological and conservation studies, such as species distribution modelling, conservation assessments and prioritization of biodiversity conservation.

Thelypteris noveboracensis voucher harvested in Monongalia County, West Virginia. Credit: West Virginia University.

Some of the package functionalities are still focused on plant species, but if the species records follow the Darwin Core standards, many of the plantR functions will be useful for any group of organisms and any type of information (e.g., museum specimens, human observations & photos).

The package deals with different types of information associated with species records, such as collection codes, people and localities names, geographical coordinates, and species identifications. Moreover, it provides tools for retrieving duplicates across collections, including the homogenization of the information within groups of duplicates, which is handy for exchanging information updates among collections. It also provides tools for uploading, summarizing and exporting species records, as well as the generation of species lists. plantR brings many novel features to manage species records, but its main strength lies in performing all steps, from the data access to export, in one environment.

The approach

The data validation process of plantR relies on carefully curated maps and dictionaries provided with the package, such as gazetteers, lists of taxonomist names, and plant collections. The curation of these accessory files is key for assessing data quality. But it’s also laborious, particularly for the package gazetteer and locality variants. Since time and funds are always limited, we started by the Neotropics, a megadiverse region in which we focus most of our research.

It is important to note that plantR does not edit the original information of species records, instead it stores the standardized information separately so that collection managers and curators can compare the original and edited information. This is an important, applied goal of the package: provide easy-to-use tools and tutorials so that the information associated with species records can be improved at its source: the biological collections. And if possible, saving the time of collection managers and curators in the important but difficult task of maintaining their collections, no matter how big they are.

The workflow

The application is accompanied by a workflow to process information from species records. But most tools can be used independently of the workflow as well, according to the user’s needs. The main steps of the workflow are the following.

Step 1 – Data Entry: Users can enter species records in three different ways: (i) directly from the GBIF online interface (i.e. Darwin Core Archive zip files); (ii) download records directly from R from GBIF and CRIA;  (iii) users can load their own datasets.

Step 2 – Data standardisation: Editing and standardising fields associated with species records are important to prepare records for validation. The package provides tools to standardise: (i) plant collection codes, (ii) collectors and identifier names, collector number and collection year, (iii) locality information (e.g., country names), (iv) geographical coordinates, and (v) taxonomic information (i.e. name notation and synonyms).

Step 3 – Data validation: The package performs (i) the validation of locality information and (ii) geographical coordinates. The application also flags records that are possibly related to (iii) spatial outliers or (iv) cultivated specimens. Moreover, plantR classifies (v) the confidence level of species identifications. Finally, the package performs (vi) the search for duplicates across collections and (vii) the homogenization of information within duplicates, allowing the use of the best information available across collections.

Step 4 – Data summary and export: the summary of (i) the data itself (e.g., number of records, collections and species) and of (ii) the data validation process. It is also possible to (iii) construct species lists with voucher specimens and (iv) export/save records by groups (e.g. families, countries, collections).

The future

plantR is a long-term project that will continuously improve the maps, gazetteers and databases provided with the application and include tutorials in different languages (i.e., English, Portuguese, Spanish and French) for broadening the audience of possible users and to promote how users can make the most of its tools. Thus, we hope that this new package can have a positive impact on how we assess and monitor global biodiversity.

To read the full Methods in Ecology and Evolution article, click on the following link: “plantR: An R package and workflow for managing species records from biological collections”. For a detailed introduction, check the package tutorial here. The full details on the implementation of plantR can be found on the package GitHub here.