Post provided by Daniel Vedder, Markus Ankenbrand, and Juliano Sarmento Cabral
Five years ago, a new institute opened its doors at the University of Würzburg: the Center for Computational and Theoretical Biology (CCTB). The idea was simple. Take six computational research groups, covering topics from image analysis to genomics and ecological modelling, put them in a building together, and see what happens.
Despite our disparate areas of expertise, this “experiment” has worked really well. It soon turned out that one of our greatest strengths as an institute lay in the cumulative computer know-how we have, or have acquired together. In our experience, many biologists are still somewhat wary of computational techniques, and struggle with them even when they use them. Part of the reason for this unease, we believe, is that few biologists are thoroughly trained in computer science.
Our big advantage was that we had people who had this training, either from a formal computer science degree, or from years of research or even hobby programming experience. Building on this foundation, we were able to invest heavily into further training for all of us. We offered programming courses for students, as well as more advanced software development workshops for our active researchers. As a result, we have gained greater confidence and a higher productivity in the every-day research work that we do.
The situation in ecological modelling
Two of us (Juliano and Daniel) are members of the Ecosystem Modelling Group at the CCTB. For our group, improving our general software development skills has been particularly important. Because we work with individual-based models (IBMs), we do not only write software to analyse data, but to generate data as well. Therefore, our level of programming expertise has a major influence on the reliability of our research.
Of course, this challenge is not particular to our little group. IBMs have become increasingly popular over the past three decades, and are now widely used in ecological and evolutionary research. They are fantastic tools, enabling new questions to be asked and experiments to be conducted that would be unthinkable without them. Unlike statistical models, they allow a mechanistic rather than a purely correlative investigation of ecological phenomena. And unlike analytical mathematical models, they can deal with highly complex systems that are much more representative of real ecosystems.
Not surprisingly, this complexity is a double-edged sword. Analysing the data output of a large IBM can be almost as difficult as analysing real-life ecological data. Also: how reliable is a model where you don’t fully understand how the different parts play together? Accordingly, the ecological modelling community has developed a whole range of methods to best deal with this issue of biological complexity. These include the ODD and TRACE protocols, pattern-oriented modelling, “evaludation”, and others.
However (and this is important), IBMs are not only biologically complex. Rather, they display a double complexity. IBMs simulate complex ecological systems using computer code (software) that is itself a complex system. This is something that has received much less attention among ecological modellers.
In computer science, it is well recognised that managing the complexity of a code base is one of the most important jobs any software developer has. Large programs quickly become impossible to completely understand, giving rise to all sorts of unexpected behaviours and bugs. For decades, computer scientists have therefore thought about how to reduce the problems this technical complexity entails.
Unfortunately, many ecological modellers are unfamiliar with the computer science literature and unaware of the techniques that have been developed to this end. This is a major problem for our research field, because of the way we are dependent on our self-written code. If we want to produce reliable scientific models, we need to write reliable code – and to do so, we have to learn how to deal with technical complexity as well as with biological complexity.
What we have learnt
Our paper grew from this realisation. Our aim was firstly to get ecological modellers to think about the double complexity of IBMs, and secondly to introduce them to ways of dealing with the technical complexity of their models. To do so, we identified four general strategies that can be used to manage complexity: avoiding, subdividing, documenting, and reviewing. We then presented multiple techniques for each of these strategies. Throughout, we combined theory from the computer science literature with illustrative examples of open-source ecological IBMs.
For example, one very simple way of avoiding unnecessary complexity is to write clean code. Using clear formatting, suitable variable names, and adequate comments reduces the cognitive cost of understanding a piece of code. Code that is easy to read is easy to understand, and easier to fix and modify than code that is overly dense and obscure.
Similarly, the choice of programming language can have a major impact on the ease of creating and maintaining a large model software, so this should be given sufficient thought before starting a project.
To subdivide complexity, models can be split into different modules that each take care of different tasks. This helps developers to focus on one section of the code at a time, rather than keeping track of the whole program.
Reviewing complex software is important because there will always be programming errors (bugs) that the initial developer misses. To catch these, a team of developers can review each other’s code, or build up an automatic test suite that runs at regular intervals.
Finally, documentation is important to bring new users and collaborators onboard, but also to help the core developers orient themselves once the code base is too large to remember everything at once.
Where to go from here
As we wrote above, ecological modellers need to think more about good software development practice. We need cross-semination from the computer science literature, and would profit from collaborations with computer scientists. This is especially important for small research groups (like ours at the CCTB), which do not have the funds to hire professional software developers. All of us can learn from projects such as Software Carpentry and rOpenSci, which are working to increase computer literacy among scientists more generally.
“An ecologist and a programmer walk into a bar” – quite frankly, that is just what we need. As ecological modellers (and computational biologists in general), we need to look beyond the boundaries of our discipline and learn from those who developed the tools we use. Too often, the scientific potential and reliability of our work is limited by our computational skills. But that can change, because the knowledge and the skills we need are readily available, if only we’d look for them more actively. If you walked into a bar with a programmer, what would you ask?
About the Authors
Daniel Vedder joined the CCTB as a first-year biology student and has recently completed his master’s thesis in the Ecosystem Modelling Group. He has been an avid hobby programmer since high school and regularly gives workshops on software development topics.
Markus Ankenbrand earned a bachelor’s degree in computer science parallel to studying for his bachelor and master of biology. He did his PhD at the CCTB, writing a tool for ecological community analysis, and now works there as a postdoctoral researcher. He is an instructor for the Software Carpentry community.
Juliano Sarmento Cabral is a junior professor and group leader of the Ecosystem Modelling Group at the CCTB. He started working with individual-based models for his PhD and has been involved in numerous IBM projects ever since.
To read the full Methods in Ecology and Evolution article, “Dealing with Software Complexity in Individual-Based Models”, visit the journal website here.