I’m sure by now you’ve heard of MAXENT. Have you got the impression that it’s some revolutionary new method that sits apart from classical methods like GLM? If so I have some big news for you.
First a little background – maximum entropy modelling (MAXENT) had its origins in the 1950’s, and went quiet for some time before a resurgence in the machine learning literature which has blown over into ecology in a major way. Steven Phillips et al (2006) proposed a MAXENT approach to species distribution modelling using presence-only data in a spectacularly successful paper (cited 600 times in 2012), and Bill Shipley et al (2006) proposed a MAXENT method for using traits to understand community structure in the journal Science. Ever since there has been a flurry of work to understand MAXENT and its properties, recent Methods papers on the topic including Charles Yackulic et al (in press) on assumption-checking (or lack thereof), and a critical review by Andy Royle et al (2012).
Pros and cons – One reason for the appeal of MAXENT is that it tends to perform well in comparisons of predictive performance, some also see the MAXENT “minimum assumption” philosophy as attractive. On the other hand, some critics argue the method is “blackbox” (a property that always makes control-freaks like me a bit nervous) and point to a lack of clarity about the meaning of the quantity that is actually being modelled. Also, Phillips et al (2006) were most forthcoming in acknowledging that formal inferential and model checking tools are not available, at least, not in their software, in part because the methodology is not as well understood as classical competitors, such as generalised linear models.
Now for the big news – MAXENT, as implemented in presence-only analyses, has recently been shown to be exactly mathematically equivalent to a GLM, more specifically, to a Poisson regression (also known as “log-linear modelling”). Poisson regression is a standard method from statistics that is well understood and implemented on most statistical packages, e.g. R users can type glm(formula, family=”poisson”). This equivalence result is big news because a modern icon of ecological modelling has now been revealed to be equivalent to an older method that went out of fashion some time ago… it’s a bit like Beyonce taking off a face mask at the Superbowl to reveal that all along it’s actually been… Janet Jackson!
|
|
Key papers – Ian Renner and I stumbled upon this MAXENT=GLM equivalence result, and discussed some of its implications in a paper in press at Biometrics. We also made a connection with point process models (PPMs), which can often be implemented as GLMs. PPMs are pretty much THE appropriate statistical framework for modelling data that arise as a set of point locations, but their application to presence-only data was only recently appreciated — as well a paper by myself and Leah Shepherd (2010), see Avishek Charaborty, Alan Gelfand et al (2011) and Geert Aarts et al (2012), whose Methods paper also connected to the literature on resource selection functions. Trevor Hastie and Will Fithian also noticed the MAXENT=GLM equivalence result – their pre-print on ArXiv contains a number of interesting ideas.
Importance – This MAXENT=GLM equivalence result immediately changed our perspective on what MAXENT is and does – for example, in one fell swoop it dismisses all the advantages/disadvantages about MAXENT stated in this blog post (see Pros and cons). One cannot argue that MAXENT has better predictive performance than GLM when it is equivalent – any differences, e.g. those seen in the famous Elith et al paper, must be due to differences in how the methods are applied rather than due to any differences in the analysis methods per se. (The main difference in how they are applied I think is that MAXENT by default uses a LASSO penalty, rarely used with GLM, but seemingly highly effective.) Also, it is hard to argue philosophically that MAXENT is better because it makes minimal assumptions – it is mathematically equivalent to a method that makes all the assumptions we were trying to avoid in the first place! On the other hand, practitioners with an understanding of GLM can no longer argue that MAXENT is blackbox, and the “MAXENT lacks clarity” criticism can be addressed by reexpressing as a point process model. Finally, there is no longer a need to be concerned about any lack of inferential or model-checking tools – GLM and PPM have these in spades. While Yackulic et al (in press) bemoaned the lack of MAXENT assumption checking in the literature, practitioners no longer have much by way of excuses given the suite of model checking tools available for GLMs and PPMs. An important outcome of our model checks to date is that presence-only data often violate the all-important independence assumption, meaning that often we have needed to use alternatives to the MAXENT model.
It is interesting to think about what will happen now that we know the two methods are one and the same – will MAXENT go out of fashion or will GLM be everyone’s best friend again? Personally, as an “old-school” statistician, I’m hoping for a return to thinking more carefully about data properties and how to specify and check models that reflect the particular properties of any given dataset. If that sounds boring, would it help if we threw in a LASSO penalty?
PS: not sure if you’ve heard of the two-day Eco-Stats symposium in Sydney, Australia, July 11-12 2013? This will include a special session on maximum entropy modelling, featuring Trevor Hastie and Bill Shipley, as well as a presence-only modelling session, featuring Jane Elith and Adrian Baddeley. Beyonce is yet to confirm. So if this blog post is up your alley then please come along!
By David Warton
Associate Editor, Methods in Ecology and Evolution
Big news! I now better understand the results of a comparison of Maxent (default settings) vs penalized logistic regression (ridge), their predictive performance was very similar http://dx.doi.org/10.1016/j.ecolmodel.2011.04.015
Well. Why is this too good to be true? Because the equivalence of Maxent and GLM requires that presence-only locations are randomly sampled (or systematically). This assumption is required to be able to estimate the actual prevalence from the data, otherwise the output of Maxent is only proportional to the occurrence probability of GLMs. (Furthermore, the equivalence requires that presence-only locations are sampled without spatial or other bias. This latter issue can be embraced by modelling the sampling process explicitly. But it is not trivial.)
Casual readers of various papers (cited below) on the equivalence may think that plugging the data into Maxent will return occurrence probabilities identical to those of GLMs. That, regrettably, is virtually never the case.
The main problem is that the non-random sampling. I’ll try to illustrate that with a gedankenexperiment. Imagine you go out into your garden (or a friend’s garden, if you don’t have one) and want to find out where the slugs are. In a first round you take a random walk through your garden and record every slug you see. After 20 minutes you return home for analysis. In a second round you do the same for 60 minutes. In a third round you go down the garden path (pun intended) only.
Will the Maxent analysis of these data sets yield the same output? Of course not: with fewer points (round 1) you will get lower prevalence estimates and with biased sampling (round 3) you will get biased “ecology” for your slugs.
Will this analysis be equivalent to a GLM? That depends on how you generate the absences for both methods. So we have to augment the above gedankenexperiment with sampling for absences. In a typical random sampling (as in, say, forest inventories) we would a priori select sites (randomly or systematically, even stratified) and sample there – presences and absences). Imagine we do that for round 1. The GLM would now estimate prevalence correctly and the same for round 1 and 2 (as the proportion of sites which contain a slug). Maxent would NOT use the observed absences as absences, but rather newly generate background absences. Thus, round 1 and 2 would yield different estimates of prevalence with Maxent! For round 3 we’d have to make the model more complicated to allow for the bias of the garden path (which is beyond current Maxent and GLM standards).
Thus, unless the background data are sampled with the same strategy as the presences, Maxent and GLM will NOT be equivalent. Think GBIF and you will see why I am underwhelmed by this and previous papers claiming such an equivalence: it does not apply to the data people typically use Maxent for! We don’t have the foggiest idea from which sampling scheme all these “records of opportunity” were derived. Thus, we will NEVER be able to estimate prevalence correctly (with neither Maxent nor GLM).
Maxent-Poison process papers state their assumptions (particularly WRT sampling strategies) too vague for the typical ecologically minded reader to notice, I think. That Maxent and GLM are equivalent under very specific and typically rare conditions is statistically correct – and likely to be irrelevant more most of the analyses published in the leading macroecological journals.
The hyped blog above and the twitter mails excited about Maxent/GLM remind me of the reasoning for the existence of God: I would be a disaster if he didn’t, so he must! Sorry, but that is not good enough.
Papers on the topic:
Dorazio RM (2012) Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics. doi:10.1111/j.1541-0420.2012.01779.x
Fithian W, Hastie T (2012) Statistical models for presence-only data: Finite-sample equivalence and addressing observer bias. preprint (google for it):1–30
Li W, Guo Q, Elkan C (2011) Can we model the probability of presence of species without absence data? Ecography 34:1096–1105. doi:10.1111/j.1600-0587.2011.06888.x
Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods in Ecology and Evolution 3:545–555
Thanks to Carsten for the thought-provoking response. Unfortunately it was written on a false premise. The equivalence result does not require any of the “very specific and typically rare conditions” stated by Carsten – it does not require that presence-only locations are randomly or systematically sampled, it does not require that presence-only locations are sampled without spatial or other bias. MAXENT and a Poisson GLM are exactly equivalent for any given presence-only dataset, when the methods are used in the same way.
But people typically haven’t been using MAXENT and GLMs in the same way, a point mentioned in my original post as well as in Carsten’s response. And it is not just method of choice of pseudo-absences that can differ – also the types of response functions you use (linear, quadratic, etc), how you standardise data, if and how you choose a LASSO penalty. It is how the methods are applied that leads to differences in results, not the methods themselves. The equivalence result tells us that the big question here isn’t “which method do I use”, it is “how should I apply a given method”?
One might wonder where Carsten’s “conditions” came from. The suggestion that pseudo-absences should be randomly or systematically sampled across the study region is a recommendation for how to use point process models (you can even use your data to tell you how many such points you need, see Warton & Shepherd 2010). Regarding sampling or other sources of bias – as with any statistical model, you should try to account for them! No modelling approach is exempt from this rule. e.g. in round 3 of the thought experiment, I would have included a term for “distance from path” in the model to try to correct for this source of sampling bias (but see Dorazio in press). Either that or I’d try to collect some better data…
“MAXENT and a Poisson GLM are exactly equivalent for any given presence-only dataset, when the methods are used in the same way.”
Right, so the interpretation of one’s results is entirely contingent on the nature of sampling biases that may or may not have contributed to the observed presence data, regardless of whether you used MAXENT or a GLM.
The major difference, in my opinion, is that the application of point process GLMs have been implemented by folks with a better statistical understanding and adherence to the (potentially) strict nature of the assumptions. Whereas, MAXENT allows anyone with a collection of presence locations to model and map a species distribution.
Like many statistical approaches, the tool is often blamed more than the tool handler. But if David’s premise is correct (equivalence to GLMs), the main “contribution” provided by MAXENT is an increased ability for people to wield tools they do not understand.
These results highlight that Maxent and GLMs are statistically equivalent and that past stated differences are just the consequence of how these methods were applied independently of the used data (both predictors and data quality). Then, how these erroneous results have been generated in the comparative studies? May be the key to explain this question is related with the method used to determine the accuracy of these model outputs (AUC). In the absence of reliable absence information AUC cannot inform about the predictive capacity; the lack of an appropriate comparative metric would have produced this propagated misconception. When reliable absences are not available modellers may tend to consider as “better” those methods able to generate outputs more similar to the training data. False-absences in the training data act as well-predicted data when the modelling procedure is able to generate estimations similar to the training data. May be the former established difference between Maxent and GLMs is simply due to a less exigent application of GLM models.
One big difference is that GLMs have generally been logistic regression, which is different to point processes. They’re closer to the pseudo-absence approach. Another is that MaxEnt doesn’t just use a linear relationship, it includes others too (quadratics, step functions etc.), so it can fit a more flexible range of relationships.
Yes Bob GLMs have typically used logistic regression, which is different to Poisson GLMs, although as it turns out results typically do not differ by much – if you have a large number of “pseudo-absences” relative to the number of presence points answers will be nearly identical. So I suspect this distinction is not a critical one. And yes MAXENT isn’t stuck with linear assumptions only, but exactly the same can be said of GLMs. In fact species distribution modellers in the past have often fitted GLMs with quadratic terms, and those not happy with the quadratic assumption have frequently used additive smoothers (such “GAMs” can be understood as glorified GLMs). This arguably is a nicer solution than MAXENT’s step functions, and seemingly fits just as flexible a range of relationships.
From my point of view, the mathematical similarity between regression and maxent procedures is important but not excessively relevant. The key question is the use of background or pseudoabsences in Maxent and their implications in the obtained geographic representations. Logistic regressions and Maxent are equivalent when a high number of pseudoabsences are used in a use-availability procedure (U/A). U/A only estimate the probably frequency of the used presences and not a probability of occurrence.
Thanks to everyone for the interest and discussion. Just a couple of clarifications on recent posts, which have been moving away from the original topic towards logistic regression and the whole pseudo-absence issue.
On logistic regression and its equivalence to MAXENT – for a given presence-only dataset and a given set of pseudo-absences, MAXENT equals Poisson GLM exactly (always, as in my original post) and hence approximates logistic regression (usually). The approximation to logistic regression is a mathematical result which works well when the fitted probabilities are all small (Warton & Shepherd 2010) – it is not restricted to U/A problems, it happens pretty much any time the pseudo-absences are large in number and sensibly sampled (but see Hastie & Fithian for a variation using offsets). What is large – a common recommendation is to use 10,000 pseudo-absences*, and for most analysts that would be large enough for the logistic regression=poisson glm approximation to hold pretty well, since most of us have many less presence points than this. What is sensible – not avoiding regions of environmental space containing presences. These “large and sensible” conditions may not cover everyone but perhaps they should.
Note also that the user has control over how they choose pseudo-absences – so the “large and sensible” conditions above are not assumptions in any statistical sense, and can’t really be argued to be unrealistic or irrelevant. On the contrary, they are applicable to any presence-only dataset. If you are curious how relevant our equivalence results are to your recent logistic regression analyses of presence-only data, change the family to poisson, and see what changes in your results. Probably not much.
On the claim that U/A only estimates frequency not probability of occurrence (Aarts et al 2012) – yes and it’s not just U/A analysts stuck with frequency, but everyone who analyses presence-only data that arise as a set of point events (if your presence-only data arose in the form of coarse grid-cells then that’s a little different). Many U/A problems can be seen as equivalent to presence-only problems and discussed jointly (as in Aarts et al 2012). This is the topic of an upcoming special Issue in the Journal of Animal Ecology.
*The “10,000 rule” is ad hoc and far from perfect – as in Warton & Shepherd 2010, I would want to keep adding pseudo-absences until my results stopped changing!
…and so,
suddenly all the parsimonious GLM people are into spatial & temporal predictions, and not using AIC anymore? Really ? And do they really try to boost predictions into unsampled areas ?
Simply see the huge limitations of GLMs, and then move into Leo Breiman and the tree world: complex and deep there. Together with TRUE OPEN ACCESS and OPEN SOURCE for a better science-based management, that would be progress indeed.
Instead, the “Maxent similar to GLM discussion” does not mean much at all. Entertains just a few people and with an specific mindset and goals (a limited and conservative Gaussian & Fisher’ian worldview). What here about the missed chances of the last decades ?
Just to keep focusing on a few independent predictors and with linear models does not help much, nor is it often required and useful even!
That is widely shown all over, and in the real world, and for real progress.
Funny to see how long the conservative ‘linear statistics’ ivory tower and its lobby tries to hang on, and how it is argued in Britain, parts of the U.S. and elsewhere (professional societies).
C.F. Gauss died already centuries ago, did anybody notice ?
Yours kindly
Falk Huettmann PhD, Associate Professor Uni of Alaska-Fairbanks
PS. Ever tried to model multi-species vs climate layers, human variables and topographic ones with a GLM ? And who in the DNA world still uses GLMs for data mining ? It shows the folly of this discussion. In virtually all ensemble models, GLMs tend to perform very poorly (see also in Elith et al. 2006 vs Maxent). Why would that be ? And where was the GLM community then ?
It seems Falk and I agree that stats as a discipline has moved on considerably in the last couple of decades, and that there are often (but not always) better things one can do than to fit a pure vanilla generalised linear model. But the point of the original post was that MAXENT is exactly a GLM, so any performance advantages (such as those in the Elith et al paper we both referred to) are not due to some magical advantage of maximum entropy modelling. The differences come down to the way GLM/MAXENT is implemented – perhaps even due to some of the advances over the last couple of decades Falk refers to (e.g. LASSO).
“GLM” has been interpreted narrowly by Falk – to me it is a very flexible framework that has over the last few decades been extended to handle spatial and temporal effects, regularisation (e.g. LASSO), point patterns, non-linearity (additive or not), and plenty more. Trees, mentioned by Falk above, can be helpfully understood as a GLM extension. So perhaps Falk is in our ivory tower too – at the very least he should drop in to say hi!
I have followed the development of this blogpost over the last months with considerable interest, eagerly awaiting my own contribution to the opening of the MaxEnt black box becoming published. The blackbox-like nature of the best-performing and most popular distribution modelling (DM) method (MaxEnt) has puzzled me for many years. Independent of Bill Shipley and David Warton and others, I srt out four years ago to try to find out if the MaxEnt method could be explained by maximum likelihood (ML) principles. Not surprisingly, given the ‘MAXENT=GLM’ equivalence now proved by Renner and Warton, my attempts were successul. In the recently published open-access monograph ‘A strict maximum likelihood explanation of MaxEnt, and some implications for distribution modelling’ (Sommerfeltia 36: 1-132; downloadable from http://www.degruyter.com/view/j/som.2013.36.issue-1/v10208-011-0016-2/v10208-011-0016-2.xml?format=INT), I provide the mathematical and statistical details. Briefly stated, I show that MaxEnt is a ML method which treats uniformed background points as pseudo-absences and that maximises the likelihood for presence points. Given that certain assumptions are met, this opens for use of standard tools for model comparison and assessment such as the likelihood-ratio and F-ratio tests for nested models, the AIC, BIC etc. with MaxEnt. The implications of these findings are discussed (see http://www.nhm.uio.no/forskning/grupper/geco/ for abstract and more information).
The comments to the ‘big news’ posted over the last months really surprise me. While distribution modelling has been criticised for its tenuous rooting in ecological theory [despite it can be well understood from a gradient analytic perspective; see Halvorsen (2012) in Sommerfeltia 35: 1-165; downloadable from http://www.degruyter.com/view/j/som.2012.35.issue–1/v10208-011-0015-3/v10208-011-0015-3.xml?format=INT%5D and the extensive use of black boxes, findings that shed light into the corners of the black boxes and open opportunities for significant methodological improvements are dismissed as ‘hyped’ or ‘not meaning much at all’! A result of tenuous rooting in theory? Pointing out some of the important propects offered by the new insights is certainly needed. Firstly, knowing what a method does is in itself important. Such knowledge is needed to check if models are adequately specified, if choices of settings and options are likely to be optimal, to choose the right level of model complexity, and to understand when and why results cannot be trusted (just to mention a few reasons). Secondly, MaxEnt (or GLM or PPM) is not one just method, it is a tool template with can be crafted in a multitude of ways, into tools that suit different data sets and different purposes! Why use a putter for a 300-m drive! The user’s choice of the way he or she builds the models is clearly decisive for the modelling outcome and different specifications explain why MaxEnt and GLM performs differently in comparative tests.
The new insights offer alternatives in many respects to the ‘standard MaxEnt practice’ from which few users depart – the set of default options and settings in the user-friendly Maxent software developed by Steven Phillips and co-workers –and which has been challenged in severaI recent papers (e.g. Yackulic et al. in Meth. Ecol. Evol. 4: 236-243). In the Sommerfeltia 36 paper I show, by theoretical considerations and examples, that several settings and options should be carefully evaluated against alternatives that are now available. I suggest five main additions or amendments to the ʻstandard MaxEnt practiceʼ: (1) development of flexible, interactive tools to assist the deriving of variables from raw explanatory variables; (2) development of interactive tools that allow the user freely to combine model selection methods, methods and approaches for internal model performance assessment, and model improvement criteria, into a data-driven modelling procedure; (3) integration of independent presence/absence data into the modelling process, for external model performance assessment, for model calibration, and for model evaluation; (4) new output formats, notably a probability-ratio output format which directly expresses the ʻrelative suitability of one place vs. anotherʼ for the modelled target; and (5) development of options for discriminative use of MaxEnt, i.e., use of MaxEnt with presence/absence data. Tools for (1–4) are currently being developed by S. Mazzoni together with colleagues at the GEco research group at the Natural History Museum, University of Oslo. I conclude that the currently most important MaxEnt-related research needs are: (1) comparative studies of strategies for construction of parsimonious sets of derived variables for use in MaxEnt modelling; and (2) comparative tests on independent presence/absence data of the predictive performance of MaxEnt models obtained with different model selection strategies, different approaches for internal model performance assessment, and different model improvement criteria.
Hopefully, the new insights into the formerly black box of MaxEnt will lead to improved distribution models!
I’m grateful for Dave Wharton for opening up this line of discussion, and to others for progressing the discussion above. Sorry it took me so long to discover it.
I am struck throughout the discussion above that whilst the background-pseudo-absence presence-only myth got partially debunked, no one has raised an objection to the use of pseudo-absences in modelling. I know the person who first coined the term, and it was intended as a suitably pejorative term. Somehow in the discourse it morphed into a translucent form that stopped attracting the negative perception it deserves. How did that happen? I am concerned that the modelling logic that depends on the use of pseudo-absences is circular. There has been a proliferation of papers that have sought to ‘predict’ species distributions in one form or another that have claimed to have been *validated* using non-independent data, evaluated using presence and pseudo-absence data. I’m at a loss to think how we could make the science framework any sloppier. We would be rightly appalled if people manufactured presence data. Why are we so complacent and accepting of them manufacturing absence data out of their imagination? Even if we sugar-coat the process in euphemistic terms such as “specifying the background” and leaving the dirty work to an algorithm to select sample points, it amounts to the same thing. Why aren’t we scathingly critical when modellers add insult to injury and take that same set of pseudo-data and use it to test their model, as well as to train it?
Personally, I don’t buy the argument that because the (discriminative) modelling methods demand absence data and there’s no other way to generate the inputs, that we should we should just relax our principles of good science to accommodate them. There are envelope and other methods that don’t rely on this data. In niche or distribution modelling, pseudo-absence data reflects our (frequently rubbish) *assumptions* about where species cannot persist. The effect of the circular logic is to act as an echo chamber, strongly reinforcing our assumptions. How much incentive is there for a modeller to question those assumptions when the model comes back and indicates that using those same assumptions as *inputs* into Kappa, TSS, AUC, etc, they all indicate that the assumptions were correct? What am I missing here? Surely a good model framework tests assumptions through falsification rather than blindly reinforcing them.
I’ve been struck previously by the phenomenon that Dan Linden observed above, that MaxEnt, more than any other SDM has managed to attract a segment of users with little appropriate knowledge or experience, and who have managed to produce and publish some real clangers. I am dismayed that so many of these models did not come under greater scrutiny during review. It is one thing for an author to misuse a model and write a paper presenting the results, but it is another matter entirely for two reviewers, a subject editor and an EIC to all fail to scrutinise the results for ecological reasonability. Perhaps the pace of publication has outstripped our resources to properly scrutinise papers. Please note, I’m not commenting on MaxEnt here or its ability to generate useful models, but rather gently reinforcing Dan’s comment about the importance of the user being appropriately trained for the tool they are wielding. It is clearly insufficient to simply train the user how to get the system to generate a map. That process can be beguilingly simple. Underlying it is a complex natural system and an analytical algorithm that can give wacky results if the wrong assumptions and options are selected. “Default” doesn’t mean “satisfactory in all (or even most) circumstances”.