Featured

# Appearances can be deceiving

Anabel Blasco, BSc in Statistical Techniques and MSc in Statistics and Operations Research (Universitat Politècnica de Catalunya), and MSc in Mathematics for Finance (Universitat Autònoma de Barcelona), works as statistical consultant and training area coordinator at the Servei d´Estadística Aplicada of the Universitat Autònoma de Barcelona. Contact Anabel

I’m a statistical consultant. While developing my job, I have assessed many applied researchers: from botanists to andrologists, and performed many different statistical analyses: from a simple t-test, to more sophisticated analyses which are resolved through advanced statistical modelling. In order to evaluate the needs of researchers, I find necessary to meet him and let him explain the study goal, show the available data and detail of their statistical doubts. After the meeting, I usually know what kind of analysis is required.

At this point, I think we should not underestimate any study despite of what it may seem at first sight, and I think it is a serious mistake to do it. Let me explain.

As a statistical researcher, I like to work with data that test my analytical abilities while trying to extract its maximum profit. However, not always a high-level analysis is required; sometimes the simplest analysis satisfies researcher needs and expectations. Only sometimes, some seemingly harmless data, conceal a sophisticated statistical analysis that initially had gone unnoticed.

Some months ago, I had a meeting with two biologists. Their study dealt with predation of certain type of plant by some insects in different regions. They tried to use a simple ANOVA test, compare the number of plants affected by predation among regions. But, the test did not give statistically significant results. A statistician realizes quickly what is wrong: “Maybe, you are not taking into account the variability among regions and, of course, you don’t have normal data because you are dealing with counts”.

Homogeneity of variances and normal distribution are two important hypotheses in the ANOVA test. To solve the problem of non-constant variances, different alternatives are possible, for example using transformations. The most common data transformations are the proposed by Ascombe (1944) and the Box-Cox transformations (1964). These transformations not only solve the problem of non-homogeneity, but they also reduce data anomalies such as non-additivity and non-normality. Transform the data is a good solution but we can go even further. In 1972 John Nelder and Robert Wedderburn formulated the generalized linear model (GLM), a flexible generalization of the linear regression model allowing for response variables having other than a normal distribution.

Since we are evaluating counts, a GLM using Poisson distribution could be applied. The result remained the same: statistically non significant differences in count predation among regions. We started with ANOVA, then transformed the data obtaining variables with theoretically nice properties, estimated a GLM with Poisson distribution and, at the end, we were at the same point. There was something wrong. In fact, there was a subtile difference among regions: one of which had much more zero counts in contrast to other regions. These zero data could be treated in a more proper way.

The response to this problem appeared in the nineties: zero inflated Poisson models. These models are a way of dealing with overdispersion. The model assumes that the data is a “mixture” of two sorts of individuals: one group whose counts are generated by a standard Poisson regression model, and another group whose individuals have a large frequency of 0. Thus, this approach can take into account the excess in zero counts. Therefore, a zero inflated Poisson model (ZIP) was claimed to solve our problem. Moreover, in this setting, not only a Poisson can be assumed, but a Negative Binomial distribution could also be assumed (ZINB). This led me to further investigation, comparing ZIP and ZINB models with GLM with Poisson and NB distributions by using appropriate tests. The decision of using one or other model not only can be done from a statistical point of view but also using the biological interpretation.In this case, we saw that a ZINB model could model not only the count process for the data predation but also the process for zero predation.

The lesson of this story is that sometimes a simpler study can hide the most sophisticated analysis. Never underestimate the difficulty of a simple experiment because appearances can be often (and very often) deceiving.

By Anabel Blasco

Statistical Consultant