Accelerometers, Ground Truthing, and Supervised Learning
Accelerometers are sensitive to movement and the lack of it. They are not sentient and must recognise animal behaviour based on a human observer’s cognition. Therefore, remote recognition of behaviour using accelerometers requires ground truth data which is based on human observation or knowledge. The need for validated behavioural information and for automating the analysis of the vast amounts of data collected today, have resulted in many studies opting for supervised machine learning approaches.
In such approaches, the process of ground truthing involves time-synchronising acceleration signals with simultaneously recorded video, having an animal behaviour expert create an ethogram, and then annotate the video according to this ethogram. This links the recorded acceleration signal to the stream of observed animal behaviours that produced it. After this, acceleration signals are chopped up into finite sections of pre-set size (e.g. two seconds), called windows. From acceleration data within windows, quantities called ‘features’ are engineered with the aim of summarising characteristics of the acceleration signal. Typically, ~15-20 features are computed. Good features will have similar values for the same behaviour, and different values for different behaviours.
Modelling species distributions involves relating a set of species occurrences to relevant environmental variables. An important step in this process is assessing how good your model is at figuring out where your target species is. We generally do this by evaluating the predictions made for a set of locations that aren’t included in the model fitting process (the ‘testing points’).
Random splitting of the species occurrence data into training and testing points
The normal, practical advice people give about this suggests that, for reliable validation, the testing points should be independent of the points used to train the model. But, truly independent data are often not available. Instead, modellers usually split their data into a training set (for model fitting) and a testing set (for model validation), and this can be done to produce multiple splits (e.g. for cross-validation). The splitting is typically done randomly. So testing points sometimes end up located close to training points. You can see this in the figure to the right: the testing points are in red and training points are in blue. But, could this cause any problem? Continue reading →