Earlier this month Leila Walker attended a panel discussion imparting ‘Practical Tips for Reproducible Research’, as part of the Annual Meeting of the Macroecology Special Interest Group (for an overview of the meeting as a whole check out this Storify). The session and subsequent drinks reception was sponsored by Methods in Ecology and Evolution. Here, Leila reports back on the advice offered by the panel members.

For anyone interested in viewing further resources from the session, please see here. Also, you may like to consider attending the best practice for code archiving workshop at the 2016 BES Annual Meeting. Do you have any tips for making your research reproducible? Comment on this post or email us and let us know!

This year’s Annual Meeting of the Macroecology SIG was the biggest yet, with around 75 attendees and even representation across the PhD, post-doc and faculty spectrum. The panel discussion aimed to consider what reproducibility means to different people, identify the reproducibility issues people struggle with, and ultimately provide practical tips and tools for how to achieve reproducible research. Each of the participants delivered a short piece offering their perspective on reproducibility, with plenty of opportunity for discussion during the session itself and in the poster and wine reception that followed.

Attendees enjoy a wine reception (sponsored by MEE) whilst viewing posters and reflecting on the Reproducible Research panel discussion. Photo credit: Leila Walker

So… what is reproducible research?

First up was Natalie Cooper, deputy-chair of the Macroecology SIG and MEE Associate Editor, who introduced the session and got us thinking about what reproducible research actually is. At its very extreme reproducible research could be taken to mean the ability to reproduce a paper from start to finish. A rather daunting suggestion, met with nervous chuckles in the audience, and pronounced by Matt Pennell, another member of the panel, to be “impossible”! Before collectively giving up and going home, however, Natalie reassured us by imparting what for her were the two important take-home messages of the session:

Reproducibility is hard. Don’t be disheartened when you can’t produce a carbon copy of a complex analysis from start to finish
Something is better than nothing. You needn’t overhaul your whole working process; just taking on board one or two suggestions is a step in the right direction

Importantly, Natalie also discussed why researchers should concern themselves with reproducible research. Although the altruistic argument (“It’s the right thing to do!”) is often trumpeted, Natalie commented that although this is an enviable moral standpoint, an easier sell is that in most cases you stand to benefit the most: after all, it’s you that will have to re-run all that R code when, many months down the line, you get those reviewer comments back. Throw in the implications of basing policy on research that cannot be reproduced, the likely future rise in open access data, and the simple fact that it is remarkably easy to make a mistake when you’re dealing with complicated data from multiple sources, and the incentives for reproducible research are plain to see.

Write your data for machines and your code for people

Best to avoid messy datasets like this one. Note the numerous messy elements, including variables in rows, observations in columns, empty rows, merged cells and colour coding to indicate additional information.

Maintaining the ‘every little helps’ ethos was Tom Webb who began to delve down into some of the practicalities of reproducibility. After regaling us with horror stories of bad spreadsheets he had encountered (think colour-coded cells, merged columns and empty rows), he went on to discuss the foundation of reproducibility: tidy data.

Championed by Hadley Wickham, this simple concept proposes that tidy datasets have a simple structure whereby variables are in columns, observations are in rows, and each type of observational unit is a table. A common barrier to achieving this, Tom explained, is a desire to include extra information in your dataset. A simple solution: confiscate all metadata to the first worksheet.

As someone that has spent a lot of time compiling datasets from different sources, Tom suggested others in a similar position might consider providing their collaborators with a pre-structured dataframe and encouraging them to populate it. Comma chameleon, a desktop csv editor, is a tool that might help in this regard.

Workflows in R

Next up Laura Graham described how a sensible workflow helps her to achieve reproducibility in R and shared with us some of the tools she uses at each step. For Laura, the backbone of any sensible workflow involves keeping each project self-contained via an organised folder management system, and annotating your code.

At the start of the process, there are a host of packages available for loading your data into R, including readr (for csv and txt files), readxl (for Excel spreadsheets), RODBC (for a range of databases), RPostgreSQL (for PostgreSQL databases) and googlesheets (for google spreadsheets) to name but a few. Again, Laura emphasised the importance of having tidy data, and named a few R packages to help with tidying and manipulating data, such as tidyr, plyr and dplyr.

Once you’ve run your analysis and are confronted with a messy output, Laura points to broom, a package that can help with summarizing the statistical outputs of interest. When it comes to plotting, the package ggplot2 is the perfect companion for your tidy data, whilst the mention of cowplot, a useful package for removing ggplot’s rather cluttered default background, attracted an appreciative murmur from the audience.

The final step in Laura’s workflow involves RMarkdown, an authoring format that allows you to transfer outputs from your R code to the document you are working on. With lots of help available online, Laura recommends it as a great way of keeping a dynamic log of your analyses. Overleaf and Authorea were also mentioned as tools to help share work with collaborators, and given a previous comment from Matt Pennell that “co-authors are the worst!”, may well prove a helpful aid for some.

Making data open and remaining relevant

Amy Zanne, a Journal of Ecology Associate Editor, moved us on to the topic of open data and began by considering the most commonly heard arguments for and against making your data openly available. As someone who has always strived to make her data open (originally in appendices and now in Dryad), Amy cited furthering science, forging new collaborations and having others evaluate your work as amongst the reasons why you should.

If you want to be relevant in 10 years’ time, share your data! Amy Zanne

Although not arguments that she agrees with, Amy acknowledged that some are opposed to the concept because of the effort involved (both the fact that you’ve worked hard to get the data, and that the act of making data available creates work), a fear that their work will be scooped, or a belief that it disproportionately hinders Early Career Researchers (ECRs).

Amy urged ECRs to view open data as an opportunity rather than a hurdle to overcome though, and ultimately the message to all was: if you want to be relevant in 10 years’ time, share your data! The field of data archiving is a fast moving one, with an increasing number of journals requiring it and the number of open repositories growing. If you fail to get involved, you may well get left behind.

The curve of diminishing returns

The penultimate speaker, Matt Pennell, focused on finding the optimal balance between effort required and benefit gained when it comes to reproducible research. To demonstrate his point he compared 2014 Matt with 2016 Matt, the 2016 model achieving only marginally less ‘reproducibility’ than the 2014 version despite investing considerably less effort.

Having previously gone to great pains to make his research reproducible (see a 2014 piece by Matt and colleagues here), Matt has distilled his approach down to two essential elements: annotation and version control. For the latter GitHub is a tool worth exploring. Matt asserts that your goal should be for others to be able to run your analyses when provided with your data and code and some additional help from you. Expecting others to be able to run your analyses given only data and code is probably a goal too far and not worth it for the required effort.

Matt Pennell reporting on his cost-benefit analysis of reproducible research. Photo: Leila Walker

Where next for reproducible research?

Wrapping proceedings up was Nick Isaac, Associate Editor at Methods in Ecology and Evolution, who urged us to consider the different motives behind reproducible research and their respective implications. As a researcher your primary motivation may simply be to make your life easier, but other parties may be motivated by data access (your contemporaries), value for money (funders), or fear of retraction (journals and publishers). These various motivations have implications for the form data is expected to take, with journals potentially being more interested in derived data and researchers in raw data. Reaching some consensus on this, argued Nick, should be an important goal for the scientific community.

Nick closed by calling for the scientific community to get a better handle on the expectations of reproducibility and called on us all to help incrementally improve the practice of the whole community.

Interested in these issues? Remember to check out the Best Practice for Code Archiving workshop at this year’s annual meeting.

Methods Blog

Leave a comment Cancel reply