Stuck between Zero and One: Modelling Non-Count Proportions with Beta and Dirichlet Regression

Post provided by JAMES WEEDON & BOB DOUMA

Chinese translation provided by Zishen Wang

這篇博客文章也有中文版

Proportion of leaf damage is a type of measurement that can lead to proportional data.

Imagine the scene: you’re presenting your exciting research results at an important international conference. Being conscientious and aware of statistical best-practice and so you’ve included test statistics and confidence intervals on all your result figures. Not just P values! Some of the data you are presenting involves the proportion of leaf surface damaged by an insect herbivore under different treatments. You finish your presentation (on time!) and there’s time for questions. From the audience a polite but insistent colleague asks: “Your confidence interval for that estimate goes from -0.3 to 0.5… how should we interpret a negative proportion of a leaf?”.

Someone chuckles. As you nervously flick back to the slide in question, you mutter something about the difference between confidence intervals and point estimates. You start to feel dizzy. A murmur of confused voices slowly builds amongst the audience members. In the distance, a dog barks.

How can you avoid this?

Proportional Data in Ecology and Evolution

Many kinds of quantities that ecologists and evolutionary biologists routinely measure are most conveniently expressed as proportions. In many cases these proportions are derived from counts. The data are based on discrete entities that can be assigned to two or more classes: success or failure, male or female, invasive or non-invasive. In other cases the proportions are derived from continuous measurements: the proportion of time an animal spends on different activities;  percent cover of a plant functional type in a vegetation survey quadrat; allocation of total plant biomass to different organs and tissues. What these data types have in common is that they can only take values between zero and one. Negative values, or values greater than one, don’t make any sense. Continue reading

0与1的游戏:使用Beta和Dirichlet回归方法模拟非计数比例

海报作者:JAMES WEEDON & BOB DOUMA

中文翻译:Zishen Wang (王子申)

This post is also available in English

请设想一下这个场景:你正在一个重要的国际会议上汇报一个激动人心的成果。秉承一向对统计学理论和方法的严谨态度,你对所有的数据都做了统计学检验并给出了置信区间。这些统计分析结果并不只包含P值!你提供的一些数据涉及在不同处理下食草昆虫破坏的叶面积比例。当你准时完成报告时,一位同行问道:你对破坏比例估计的置信区间是-0.30.5,该怎么解释叶面积出现的负值呢?

观众席里有人笑了。你满脸通红地翻到被提问到的这张幻灯片,嘟囔着给大家解释置信区间和点估计之间的区别。观众们开始小声嘀咕,你好像听到不远处有一只狗在叫。

你该怎么避免这种尴尬又让大家疑惑的情况呢?

生态学和进化学中的比例数据

生态学家和进化生物学家会经常测定许多定量数据,为了方便展示,他们通常会把这些数据表示为比例。许多情况下,这些比例是由计数得来的。在一种情况下,这些比例数据是基于可划分为两个或者更多类别的离散实体的:成功或失败,男性或女性,侵入性或非侵入性。比例数据也可以针对连续型变量:动物进行不同活动的比例;植被调查样本中一种植物功能类型的百分比覆盖率植物生物量在各个器官和组织上的分配比例。这些比例数据的共同点是只能在0到1之间取值。小于0或大于1的值没有意义。

两种可以得到比例数据的测量:叶片损坏的比例和植被覆盖百分比。

两种可以得到比例数据的测量:叶片损坏的比例和植被覆盖百分比。

如果您使用常规统计工具来分析此类数据,可能会导致一些问题。线性回归,方差分析等方法假设因变量可以用正态分布建模。正态分布包含从负无穷大到正无穷大的值,因此不太适合模拟比例数据。用正态分布得出的预测值和置信区间很可能包含比例数据定义区间外的值。此外,残差与预测值有很强的相关性。这些现象都表明,选择错误的模型,会导致不准确的统计推断。 Continue reading

Conservation or Construction? Deciding Waterbird Hotspots

Below is a press release about the Methods in Ecology and Evolution article ‘A comparative analysis of common methods to identify waterbird hotspots‘ taken from Michigan State University.

A mixed flock of waterbirds on the shore of Lake St. Clair. ©Michigan DNR

Imagine your favourite beach filled with thousands of ducks and gulls. Now envision coming back a week later and finding condos being constructed on that spot. This many ducks in one place surely should indicate this spot is exceptionally good for birds and must be protected from development, right?

It depends, say Michigan State University researchers.

In a new paper published in Methods in Ecology and Evolution, scientists show that conservation and construction decisions should rely on multiple approaches to determine waterbird “hotspots,” not just on one analysis method as is often done. Continue reading

Mosquitoes, Climate Change and Disease Transmission: How the Suitability Index P Can Help Improve Public Health and Contribute to Education

Post Provided by JOSÉ LOURENÇO

Esta publicação no blogue também está disponível em português

©BARILLET-PORTAL David

©BARILLET-PORTAL David

Vector-borne viruses (like those transmitted by mosquitoes) are (re)emerging and they’re hurting local economies and public health. Some typical examples are the West Nile, Zika, dengue, chikungunya and yellow fever viruses. The eco-evolutionary and epidemiological histories of these viruses differ massively. But they share one important factor: their transmission potential is highly dependent on the underlying mosquito population dynamics.

An ultimate challenge in infectious disease control is to prevent the start of an outbreak or alter the course of an ongoing outbreak. To achieve this, understanding the ecological, demographic and epidemiological factors driving a pathogen’s transmission success is essential. Without this information, public health planning is immensely difficult. To get this information, dynamic mathematical models of pathogen transmission have been successfully applied since the mid-20th century (e.g. malaria and dengue). Continue reading

Mosquitos, o clima e a transmissão de patógenos: como o índice P pode contribuir para saúde pública e educação

PUBLICAÇÃO NO BLOGUE FORNECIDO POR JOSÉ LOURENÇO

This blog post is also available in English

©BARILLET-PORTAL David

©BARILLET-PORTAL David

Vírus transmitidos por vetores (ex. mosquitos, carraças) estão a (re)emergir e a ter consequências negativas para a saúde pública e para as economias locais. Exemplos típicos recentes de vírus transmitidos por mosquitos incluem o vírus West Nile na América do Norte, Israel e Europa, e os vírus Zika, dengue, chikungunya, Mayaro e febre amarela na América do Sul e África. A epidemiologia, ecologia, e evolução destes vírus são altamente diversas,  mas todos eles partilham um fator crítico: o seus potenciais de transmissão são altamente dependentes da dinâmica de população das espécies de mosquitos envolvidas.

Um dos objetivos principais do controlo de doenças infeciosas é prevenir o inicio (ou alterar o curso) de  epidemias. Para esse fim, modelos dinâmicos de transmissão têm sido usados com sucesso desde meados do século XX (ex. no contexto de malaria). Esses modelos são aproximações computacionais dos sistemas biológicos reais, permitindo simular uma multitude de cenários nos nossos computadores pessoais, e com tal testar, reconstruir e projetar o potencial e comportamento epidemiológico de patógenos. Quando tais simulações são comparadas com observações reais (ex. número de casos reportados por um sistema de vigilância), os modelos oferecem respostas sobre a mecânica de transmissão e os fatores epidemiológicos ou demográficos que terão contribuído para determinados padrões observados nos dados. Enquanto que modelos dinâmicos são uma das peças fundamentais da epidemiologia contemporânea, dados imperfeitos ou a falta deles pode tornar difícil (se não impossível) a conceção, implementação e utilidade esses modelos. As razões pelas quais dados podem ser imperfeitos são várias, desde sistemas de vigilância fracos, erros humanos, falta de investimento, etc. Continue reading

Advances in Modelling Demographic Processes: A New Cross-Journal Special Feature

Analysis of datasets collected on marked individuals has spurred the development of statistical methodology to account for imperfect detection. This has relevance beyond the dynamics of marked populations. A couple of great examples of this are determining site occupancy or disease infection state.

EURING Meetings

The regular series of EURING-sponsored meetings (which began in 1986) have been key to this development. They’ve brought together biological practitioners, applied modellers and theoretical statisticians to encourage an exchange of ideas, data and methods.

This new cross-journal Special Feature between Methods in Ecology and Evolution and Ecology and Evolution, edited by Rob Robinson and Beth Gardner, brings together a collection of papers from the most recent EURING meeting. That meeting was held in Barcelona, Spain, 2017, and was hosted by the Museu de Ciènces Naturals de Barcelona. Although birds have provided a convenient focus, the methods are applicable to a wide range of taxa, from plants to large mammals. Continue reading

Spatial Cross-Validation of Species Distribution Models in R: Introducing the blockCV Package

Post provided by Roozbeh Valavi

این پست به فارسی موجود است

Modelling species distributions involves relating a set of species occurrences to relevant environmental variables. An important step in this process is assessing how good your model is at figuring out where your target species is. We generally do this by evaluating the predictions made for a set of locations that aren’t included in the model fitting process (the ‘testing points’).

Random splitting of the species occurrence data into training and testing points

Random splitting of the species occurrence data into training and testing points

The normal, practical advice people give about this suggests that, for reliable validation, the testing points should be independent of the points used to train the model. But, truly independent data are often not available. Instead, modellers usually split their data into a training set (for model fitting) and a testing set (for model validation), and this can be done to produce multiple splits (e.g. for cross-validation). The splitting is typically done randomly. So testing points sometimes end up located close to training points. You can see this in the figure to the right: the testing points are in red and training points are in blue. But, could this cause any problem? Continue reading

اعتبارسنجی متقاطع مکانی در مدلسازی توزیع گونه‌‌ها

نویسنده: روزبه وَلَوی

This post is available in English

مدلسازی توزیع گونه‌ها به تخمین و برآورد ارتباط بین مجموعه‌ای از نقاط حضور گونه با متغیرهای زیست‌محیطی مرتبط می پردازد. یکی از مراحل اساسی این فرایند، ارزیابی قدرت مدل برای پیش­بینی مکان‌هایی است که احتمال حضورگونه در آنجا وجود دارد. این کار اغلب با ارزیابی پیش­بینی انجام شده در مجموعه‌ای ازنقاط که در فرآیند مدلسازی مورد استفاده قرار نگرفته اند (نقاط آزمایشی) صورت می‌گیرد.

تقسیم تصادفی داده‌های حضور گونه به نقاط آزمایشی و آموزشی

تقسیم تصادفی داده‌های حضور گونه به نقاط آزمایشی و آموزشی

مطالعات پیشین بر این نکته تاکید دارند که به منظور ارزیابی معتبر، نقاط آزمایشی باید مستقل از نقاط آموزشی باشند، این درحالیست که داده مستقل واقعی به ندرت در دسترس می باشد. به همین دلیل، در فرایند مدلسازی معمولا داده‌های موجود را به دو قسمت داده‌های آموزشی (برای کالیبره کردن مدل) و داده های آزمایشی (برای ارزیابی دقت مدل) تقسیم می‌کنند، این استراتژی می‌تواند چند قسمتی هم باشد (برای مثال اعتبارسنجی متقاطع یا cross-validation). از آنجاییکه این تقسیم بندی معمولا بصورت تصادفی انجام می‌شود، بنابراین گاهی اوقات نقاط آزمایشی در فواصل نزدیک به نقاط آموزشی قرار می‌گیرند. شکل زیر این مساله را به خوبی نشان می دهد که در آن نقاط آزمایشی به رنگ قرمز و نقاط آموزشی آبی هستند. اما آیا این مساله می‌تواند مشکلی ایجاد کند؟ Continue reading