Post provided by David bapst

To celebrate the 10th Anniversary of the launch of Methods in Ecology and Evolution, we are highlighting an article from each volume to feature on the For Volume 3, we have selected ‘paleotree: an R package for paleontological and phylogenetic analyses of evolution‘ by David W. Bapst (2012). In this post, David discusses the background to the Application he wrote as a graduate student, and how the field has changed since.

I was a fourth year graduate student when I first had the idea to make an R package. Quite a few people thought it was a bit silly, or a bit of a time-waste, but I thought it was the right thing to do at the time, and I think it has proven to be the right decision in hindsight.

The story behind paleotree

My dissertation was on the approaches available for dating phylogenies of fossil taxa when character data wasn’t available. At the time, it was very common to see papers dating trees by simply taking the age of the oldest appearing taxon in that clade, and assigning that date as the node age for that group.

This was usually because we wanted to do macroevolutonary analyses with phylogenetic comparative methods, which means we needed dated trees. Of course, phylogeny and stratigraphic order don’t always match, so when the oldest appearing taxon in a group was also very nested, a whole bunch of nodes would be shoved together, as if they had split simultaneously, in the same instant of time. So, sometimes, users would add a little extra time (say, a million years) to space out the nodes further. For big trees, however, this approach sometimes moved the root age backwards, tens of millions of years earlier. There were a few other approaches as well, all of which each person had to code themselves or find a script somewhere. Even worse, the transparency of what people were actually doing was often unclear – some papers would say they did one thing in the paper, but their code showed that they did something else.

Unfortunately for me, I really wanted to apply comparative methods to the datasets I was going to generate in my dissertation, but the approaches available to date a tree were not well-established at the time. So I decided to try to do something about it, and make better methods. Along the way, I implemented the existing methods and developed approaches to simulate fossil data and their incompleteness. My goal was to explore how ancestor-descendant relationships convert to poorly resolved cladograms, so that I could see what effect the different methods had on our inferences.

All of this, and more, was not available for people to use at the time. So, to me, it seemed worthwhile to create my own R package and make these approaches available. One of my inspirations was Gene Hunt’s paleoTS R package, which implemented the time-series analyses that Gene had developed for testing for stasis and punctuated change in the fossil record.

Writing an R package wasn’t really meant to be a chapter of my dissertation, and even if I thought it was important to write it as a package, some suggested I should wait until after my PhD. So, to make the case for working on a manuscript that wasn’t a chapter, I suggested that I write a short paper accompanying the package, highlight the use of the package, and that this would really help make me stand out after I graduated. I had seen the software reports describing R packages in Methods in Ecology and Evolution, and so I started making my most useful code into functions, and then used package.skeleton (devtools had only been around for half a year on CRAN, and I wouldn’t start using devtools myself until 2013). I then discovered how each of my functions suddenly had an .Rd file, waiting for me to fill with documentation. Not wanting to delay submitting to CRAN before I submitted my manuscript to MEE, I spent a week doing the minimum documentation necessary. At that point, I discovered that programming was hard, but it was much more time consuming to write detailed documentation that would be understandable to another person.

Hindsight is always 20/20

Looking over the 2012 paleotree paper in MEE now, its funny to think about what I decided to showcase.

I have a section on plotting taxonomic diversity over time in a given dataset, something I had spent a lot of work doing, as a way of validating my simulation methods, and yet (to my knowledge) I don’t think hardly anyone has used it in a publication, not even me. The methods for estimating sampling rates are also highlighted because I thought they would also become widely used (they didn’t, even after I expanded the repertoire of methods available in 2014).

I also showcase the simulation methods a fair bit and, while I know a few published studies that have used my simulation tools (e.g. Quental and Marshall, 2015), I think my overall hope that my simulation tools would become widely used didn’t come to fruition. As it is, they had a number of novelties that aren’t really explained here, so I would need to detail the simulation algorithm in my 2013 PLOS One paper, and explain more details in my 2014 Paleobiology paper. The entire section actually no longer bears any relationship to the current simulation tools in paleotree, as I later realized an alternative approach to simulating incomplete fossil records under different conditional scenarios, that was quicker and fixed a number of issues with my previous approaches.

Most studies which cite my 2012 paleotree paper do so because the authors of those studies used functions in paleotree to date a phylogeny. Many of the uses I am cited for are for people using dating methods I didn’t invent, I merely implemented in an R package. The ‘better method’ for dating trees that I was working on (now known as the cal3 method) is mentioned briefly in my 2012 paper (as the ‘src’ method). My later work (Bapst, 2014, Paleobiology; Bapst & Hopkins, 2016; Bapst et al., 2016; Lloyd et al., 2016) would reveal through simulation and application to empirical data that while the previous approaches could produce fairly misleading inferences, my ‘improved’ method was not as improved as I hoped.

The examples in the MEE paper are all on simulated data, rather than an empirical dataset, even for simple methods such as plotting diversity curves. I would realize later that this was making it difficult for new users to figure out how to use paleotree with their own data. I am not sure how I did not realize, back in 2012, that using only simulations in my examples would be really off-putting to others. I definitely saw my folly when I started receiving a number of emails from confused users, asking how to get my simulation functions to ‘read their data’. Instead of using the MEE paper as a guide to using the package, I more often hear that people looked at a tutorial I posted a few years ago on my abandoned blog, involving a real dataset of retiolitid graptolites. (That blog post was originally going to be a formal vignette, but I never got around to it.)

How have things changed since 2012?

While I am a little embarrassed by the aged content of the MEE paper from 2012, the truth is, I never really stopped writing that paper. As a static document, that paper introduced users to paleotree in 2012, and although the usefulness of that static document today is questionable, the paper also serves as an anchor for the package itself, linking the package itself with the academic literature. The living package paleotree continually changes in ways that a standard journal article cannot, as I add new functions, maintain and improve existing functions, and clarify documentation. All of that work, a product of eight years of off-and-on development, can be easily cited by referring to the unchanging, 2012 MEE paper.

With regards to my motivation to improve the transparency and reproducibility of dating analyses for fossil trees, I would say that many of the analyses that use simpler approaches now refer to explicit implementations in paleotree or similar packages (such as strap). Now it is much easier to figure out what most studies are actually implementing.

While the cal3 method of dating is still used, it is instead much more common to see studies apply full fossilized-birth-death analyses (‘FBD’; Heath et al., 2014, PNAS) using MrBayes, BEAST2, or RevBayes. In 2012, I couldn’t imagine that standard suites of Bayesian phylogenetic inference software, like the above, would quickly adopt the FBD models that account for incompleteness in the fossil record, nor implement the ancestor-move (Gavryushkina et al., 2014, PLOS Computational Biology), allowing the MCMC to consider that some sampled tip taxa on a phylogeny might be direct ancestors of other tip taxa. In Bapst & Hopkins (2016), my co-author and I argued that cal3 had actually become old hat, such that workers interested in cal3 should consider FBD analyses from standard inference software instead. Although this means the process itself cannot run within R, newer functions in paleotree exist for prepping the input files needed for doing FBD tip-dating analyses in MrBayes.

If developing paleotree and submitting to MEE has changed anything in this world at all, then it has changed me. Being the author and maintainer of a mildly-worn R package for the last eight years has been an extremely educational experience. As a paleontologist, I already had all the benefits and penalties of someone sitting at the intersection of earth science and evolutionary biology, but speaking with the broad base of paleotree users made me realize just how different perspectives could be within my own field, with the micropaleontologists, the invertebrate paleontologists, and the vertebrate paleontologists who contacted me for assistance sometimes having inverted senses of what terms common to our field were even supposed to mean, such as ‘first appearance’ and ‘last appearance’. In a sense, it was almost as if each paleontologist, typically trained to a particular taxonomic group, started from a default assumption that everyone else’s fossil record was just like the one they studied. This meant stumbling blocks for one group were not the same for other groups, and so the code and the documentation had to evolve to be understandable no matter what starting point a reader began from.

This meant I had to divorce myself from my own preconceptions when writing documentation for paleotree. Rather than assume that the words I used, like ‘taxon’, ‘ancestor’, and ‘speciation’ meant what they had meant to me, I had to instead carefully avoid words that people had deeply held but divergent preconceptions about, and find more general words to express what a particular function did. I had been told as graduate student that I should always write scientific papers as if the best, most knowledgeable person I could imagine was reading my study. For writing documentation, I flipped this on its head, and instead imagined myself when I was a beginning graduate student, with a head full of confused paradoxes and wrong assumptions, and an inability to translate warning and error messages from R into plain English. I took this perspective deeply to heart, that functions and documentation should be written as if the average user is likely the most confused and (perhaps) misled among us.

If we aren’t writing our software to be used safely and wisely by first-year graduate students and undergrads, who do we think we are writing software for? I would say I have saved myself numerous times already by taking this approach, often stumbling on documentation from paleotree that carefully dissects some issue of practicality that I had entirely forgotten ever writing documentation about. So, perhaps the most confused and misled of all of us is simply ourselves, eight years later.

To find out more about dating phylogenies containing fossil taxa, read the Methods in Ecology and Evolution article, ‘paleotree: an R package for paleontological and phylogenetic analyses of evolution’ and check out the R package paleotree available from RCran.

Read our editors’ favourite evolution articles: 10th Anniversary Vol. 3: Editor’s Choice