Post provided by Alex Chan Hoi Hang, PhD student, Centre for the Advanced Study of Collective Behaviour, University of Konstanz
The story of this project can be traced back to 2019, as a second-year undergraduate in biological sciences at Imperial College London, UK, where I took an animal behaviour course. For one of the hands-on sessions, Dr. Julia Schroeder (who later became my undergraduate and masters project supervisor), walked into the room, and gave us multiple 1.5-hour long videos of house sparrows visiting a nest box. Our task was simple: download VLC media player, watch the videos at 4x speed, then mark down every time a sparrow entered or exited a nest box. That was the first time I experienced coding behaviour from videos: after a few hours you start getting fatigued, your eyes water, you get scared that you might have missed an event because you blinked. But then you realize this is the bread and butter for behavioural ecologists: researchers take out a camera, film videos of animals, then manually watch them afterwards to code for behaviours of interest. In my opinion, this is what hands-on sessions should be: giving us an opportunity to really experience how research is conducted.

Fast forward two years later, I started a masters programme on computational methods in ecology and evolution, still in Imperial College, UK. There, co-supervised by Dr. Julia Schroeder and Dr. Will Pearse, I took on the challenge of automating parental visits in the sparrow videos using computer vision. Eight grueling months of coding (mostly in my dorm room due to covid) later, none of my attempts to fully automate the annotation worked. I managed to significantly cut down annotation time by trimming the 1.5 hour videos to short chunks of video clips to be reviewed by human annotators, and later published the results (Chan, 2023). While I was proud that my masters project was published, deep down I knew the job was not done, I still didn’t manage to automate the whole pipeline, and I knew it was possible.
I then moved on to do a PhD in the Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Germany, with Dr. Fumihiro Kano to Develop computer vision tools for animal behaviour; mainly focused on 3D posture estimation in birds. One day, I looked across my desk at the postdoc sitting next to me, Dr. Prasetia Putra, while she was automating human eating videos. Prasetia, with a background in computer engineering, applied a simple object detection model called YOLO to her videos. (a model that detects objects on an image and predicts a box around it). Instead of training the model to identify objects, she trained the model to identify eating events, which on an image just means detect “when the hand is touching the mouth”. I was blown away when I saw it, the method was so simple yet so effective! At that very moment, I knew this method would work on the house sparrow nest box videos that I struggled with during my masters.
The rest was history. I first made sure Prasetia was fine with me trying the method to quantify animal behaviour, as I instantly knew how much blood sweat and tears it might save for researchers trying to code thousands of behavioural ecology videos. And of course, the first thing I did was to try YOLO on the sparrow videos, and it worked beautifully. This was followed by an excited email to my former masters supervisors, Julia and Will, titled “I did it :)”. After collating and testing a few more datasets, I showcased the robustness of the method across 5 case studies: quantifying parental visits in sparrows, eating in Siberian Jays and humans, pigeons courting and feeding, and zebras and giraffes moving and browsing. The method worked great, models were easy to train, annotation didn’t take too much time.
With the framework now published in Methods in Ecology and Evolution, I look forward to seeing how effective this can be for different systems. I tried my best to make the documentation is as detailed as possible, so biologists can readily replicate this. While YOLO models may solve a lot of computer vision problems, they may not be the magic solution for all of them. Particularly, being able to automatically track individual identities is still a largely unsolved problem. Without knowing which animal is doing each behaviour, there is often no point automating behavioural coding. Hopefully many of these problems will slowly be solved in the coming years, and there can be a new age where most video annotations can be automated with computer vision, so we will not need to manually code videos ever again.
If you would like to try out the method, check out the code and documentation! And of course, go check out the paper!
Refs:
Chan, A. H. H., Liu, J., Burke, T., Pearse, W. D.*, & Schroeder, J.* (2023). Comparison of manual, machine learning, and hybrid methods for video annotation to extract parental care data. Journal of Avian Biology, e03167.