B for Biology: Not just counting sheep

As some of my co-bloggers have mentioned before, Biostatistics has been closely associated lately with studies in the health sciences and has somehow forgotten the wider biological side of things. I will be focusing today on ecological and environmental matters and the statistical approach to this kind of problems.

According to Smith (Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 589-602; John Wiley & Sons, 2002), ecological Statistics can be defined as the area of Statistics that focuses on ecological problem solving, where Ecology can be understood as the scientific study of the distribution and abundance of organisms. It will cover, therefore, “sampling, assessment, and decission making for both policy and research” (Patil, G.P. , Environmental and Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 672-674; John Wiley & Sons, 2002), and will require of advanced techniques to ensure the correct modelling of complex univariate and multivariate relationships (often nonlinear) from both spatial and temporal perspectives.

Slide2

To fully understand this field of study, we would initially need to make a clear distinction between single species and multispecies analysis, two diametrically opposed approaches calling for different statistical strategies.

The former is mainly based on measurements of the species abundance and performance (survival, growth, and recruitment). As such, it encounters an old dilemma: how to keep observational bias to a minimum? Petersen and transect methods are used to monitor wildlife census and avoid this and other biases, and advanced methodology like mixed models, flexible regression techniques, spatial and temporal statistics, and Bayesian inference are applied in the analysis itself.

Multispecies analysis on the other hand, deals with the complicated interactions and dependencies existing in the various ecosystems. The notions of diversity – measuring global changes in different species as a community, and mostly criticized for the potential lack of ecological relevance of some of the measures – and integrity – metrics accounting for a certain ecosystem unimpaired state; read this interesting article for further discussion on the difference between health and integrity –  are its main pillars. Multivariate analysis of ecosystems includes methods like correspondence analysis and redundancy analysis, amongst many others.

It is worth noticing that the latter is nowadays a major focus of research as a direct consequence of an increasing public awarenes of the need to preserve endangered ecosystems in order to ensure the whole planet´s good health.

In the end it is not just about counting sheep but how to count them, ensuring representativity, and considering issues like diversity and integrity in their relationships with other species.

Note: as a proof of the importance of this broad area on its own, there is a multidisciplinary journal dedicated to the topic, “Environmental and Ecological Statistics”, and an exhaustive R Task View called “Analysis of Ecological and Environmental Data” available here. “Analyzing Ecological Data” (2007) by Zuur, Ieno & Smith is also highly recommended.

Have you faced any of these problems? Any tips? Many thanks for your comments!!

Advertisements

Biostatistics software review

Nowadays, most of us would not be able to perform our daily job without software. It is therefore essential to choose the right one because either we want it or not, it will become our (sometimes hated, most times loved) closest companion.

Thanks to the fast development of technology and trying to obtain an answer to more complex biomedical problems, several software manufacturers have produced statistical packages oriented to different fields of Statistics.

In this post we intend to give an overview of some of the software available and in use in biostatistical research by classifying them in three main categories, i.e, general use, specialized and tailored alternatives.

General use

S-Plus/R

S-Plus and R  are both statistics and programming environment software. They provide the opportunity of customized data analysis coding using a high level programming language. It can be said that R and S-Plus are quite close, since they speak the same dialect – the code is the same –  and consequently, the syntax can be used under the other platform without any change. Conversely, the main remarkable difference between both programs is that R is a GNU licensed software, that is, it is free and can be accessed and adapted to suit each researcher data analysis requirements.

Among the multiple R user-friendly interfaces available, we would highlight the following:

  • RStudio is a free and open source integrated development environment for R, that can run under Windows, Linux, Mac or even over the web using RStudio Server. As a special feature, it is organized into four different work areas: the console for interactive R sessions, a tabbed source-code editor to organize a project’s files, another frame with the workspace as well as a history with the commands that you have previously entered and finally a frame that provide us with an easy administrative tool for managing packages, files, plots and help.
  • R Commander´s main advantage would be the fact that it does not require to download the interface itself. You can just access it by simply calling the package Rcmdr from your R console and it allows for both options-selection and coding. However, it is somehow limited in the choices for selection.
  • RKWard is meant to become an easy to use, making R programming easier and faster, by providing a graphical front-end that can be use by inexperienced users in R-language as well as experts. As RStudio, it can be run under Windows, Linux and Mac and cannot be loaded from within an R session (like R commander), but it has to be started as a stand-alone application. 
  • Deducer is another graphical user interface (GUI) for R that avoids the hassle of programming. Amongst its outstanding features, we would highlight its plot builder tool with multiple customisation options.

As a particular application of R it is worth mentioning one widely used in the analysis of genomic data:

  • Bioconductor, with more than 600 R packages, is focused on the analysis of high throughput genomic data including analysis of microarrays and dealing with sequence data or variant files such as those generated by Next Generation Sequencing projects.

SAS

Statistical Analysis System (SAS) is an integrated software package which allows to program tasks such as statistical analyses, reports of results and operational research studies or quality improvement.  Though it is oriented mostly to business or insurance enterprises, SAS has become an important tool in biomedical research in latter years. It must be pointed out that the code is based on PL/1 language.

IBM SPSS

Although mainly used in the Social Sciences field, this software is often chosen by professionals in the area of Biomedicine for its ease of use and attractive graphics.

STATA

STATA (Statistics+data) is another well-known package for data analysis. It was created in 1985 by StataCorp and its use is focused mostly on business or epidemiology research. For the current version details, go here .

The above mentioned statistical packages are the most used in our field. But, many times, as the statistical analyses require, specific software is required to obtain a solution to our problem. Other software that might fit more specific needs is detailed below.

Specialized

WinBUGS

WinBUGS is a statistical software for analyzing Bayesian complex probability models using Markov chain Monte Carlo (MCMC) methods. This software is part of the BUGS (Bayesian inference Using Gibbs Sampling) project. It was created to run under Microsoft Windows as an independent program but it is possible to access it through the package R2WinBUGS from R software.

 There is another version of WinBUGS called OpenBUGS, an open-source version of the package, which it can be called from R (with package R2OpenBUGS) and SAS, amongst others. Another alternative to WinBUGS (an open source program) is JAGS (Just Another Gibbs Sampler) and can be accessed through R via R2jags or RJags.

MLwiN

It is an important package for fitting multilevel models developed by the Bristol University. Its main feature is  an equation window where one can write the model with the parameters to be estimated.

Mplus

The general modelling approach of Mplus is to describe the collected data by means of latent variables and path diagrams. Thus, the statistical techniques mainly used are exploratory and confirmatory factor analysis, path analysis, and  hierarchical models.

Tailored alternatives

  • EpiLinux is an operating system especially orientated towards those professionals, researchers and students working in the areas of Epidemiology, Biostatistics, and health studies in general. EpiLinux 3 is based on GNU/Linux Ubuntu 12.04 LTS with Lightweight X11 Desktop Environment (LXDE) and is a joint project of the Dirección Xeral de Innovación e Xestión da Saúde Pública de la Xunta de Galicia and the Biostatistics Unit of the Universidad de Santiago de Compostela. For further information and download, visit the following website.
  • BioStatFLOSS , similarly to EpiLinux but restricted in this case to Windows operating system, gathers programs specifically designed for the implementation of epidemiologic, biostatistical and health studies in general. Its major advantage is the fact that no installation is required. You can download it here.
  • Epidat is a free user-friendly programme developed by the Servizo de Epidemioloxía de la Dirección Xeral de Innovación e Xestión da Saúde Pública de la Consellería de Sanidade (Xunta de Galicia) with the institutional support of the Organización Panamericana de la Salud (OPS-OMS) and purposefully built for the analysis of epidemiologic data. More information can be found here.

All these tools will definitely make your life as a  biostatistician so much easier, but now it is your choice!! You could even keep on doing your number crunching by hand 🙂

We would love to hear about your experience with software in Biostatistics, please leave your answers in the poll below. Thank you!

Invitation to the XIV Spanish Biometric Conference 2013

Elvira Delgado Márquez, MSc in Applied Statistics, BSc in Computer Engineering and BSc in Statistics (University of Granada) is a PhD student at the University of Castilla-La Mancha where she works with Professor López Fidalgo and Dr. Amo Salas. Her area of expertise is Optimum Experimental Designs.Contact Elvira 

The term “Biometry” has been used to refer to the field of development of statistical and mathematical methods applicable to data analysis problems in the biological sciences. Statistical methods for the analysis of data from agricultural field experiments to compare the yields of different varieties of wheat, for the analysis of data from human clinical trials evaluating the relative effectiveness of competing therapies for disease, or for the analysis of data from environmental studies on the effects of air or water pollution on the appearance of human disease in a region or country are all examples of problems that would fall under the umbrella of “Biometrics” as the term has been historically used.

Recently, the term “Biometrics” has also been used to refer to the emerging field of technology devoted to identification of individuals using biological traits, such as those based on retinal or iris scanning, fingerprints, or face recognition. Neither the journal “Biometrics” nor the International Biometric Society areengaged in research, marketing, or reporting related to this technology. Likewise, the editors and staff of the journal are not knowledgeable in this area.

On behalf of the Spanish Biometric Society, the area of Statistics and Operations Research at the University of Castilla – La Mancha welcomes the celebration of the XIV Spanish Biometric Conference – 2013 that will be held in Ciudad Real (Spain), from the 22nd to the 24th of May, 2013.

Full information can be found at the Conference´s website  as well as contacting the following e-mail address: biometria2013@gmail.com

We invite scholars willing to promote de development and application of the mathematical and statistical methods in the areas of Biology, Medicine, Psychology, Pharmacology, Agriculture, Bioinformatics and other areas related to life sciences, to come to Ciudad Real and participate in the presentation of the latest results in these areas.

Furthermore, the Biometrical Journal (edited in cooperation with the German and the Austro-Swiss Region of the International Biometric Society), indexed in Journal Citation Reports (JCR), will publish a special issue with a selection of the papers presented at the conference.

We remain at your disposal and we look forward to welcoming you in Ciudad Real very soon.

Elvira Delgado on behalf of the organizing committee.

Appearances can be deceiving

Anabel Blasco, BSc in Statistical Techniques and MSc in Statistics and Operations Research (Universitat Politècnica de Catalunya), and MSc in Mathematics for Finance (Universitat Autònoma de Barcelona), works as statistical consultant and training area coordinator at the Servei d´Estadística Aplicada of the Universitat Autònoma de Barcelona. Contact Anabel            

I’m a statistical consultant. While developing my job, I have assessed many applied researchers: from botanists to andrologists, and performed many different statistical analyses: from a simple t-test, to more sophisticated analyses which are resolved through advanced statistical modelling. In order to evaluate the needs of researchers, I find necessary to meet him and let him explain the study goal, show the available data and detail of their statistical doubts. After the meeting, I usually know what kind of analysis is required.

At this point, I think we should not underestimate any study despite of what it may seem at first sight, and I think it is a serious mistake to do it. Let me explain.

As a statistical researcher, I like to work with data that test my analytical abilities while trying to extract its maximum profit. However, not always a high-level analysis is required; sometimes the simplest analysis satisfies researcher needs and expectations. Only sometimes, some seemingly harmless data, conceal a sophisticated statistical analysis that initially had gone unnoticed.

Some months ago, I had a meeting with two biologists. Their study dealt with predation of certain type of plant by some insects in different regions. They tried to use a simple ANOVA test, compare the number of plants affected by predation among regions. But, the test did not give statistically significant results. A statistician realizes quickly what is wrong: “Maybe, you are not taking into account the variability among regions and, of course, you don’t have normal data because you are dealing with counts”.

Homogeneity of variances and normal distribution are two important hypotheses in the ANOVA test. To solve the problem of non-constant variances, different alternatives are possible, for example using transformations. The most common data transformations are the proposed by Ascombe (1944) and the Box-Cox transformations (1964). These transformations not only solve the problem of non-homogeneity, but they also reduce data anomalies such as non-additivity and non-normality. Transform the data is a good solution but we can go even further. In 1972 John Nelder and Robert Wedderburn formulated the generalized linear model (GLM), a flexible generalization of the linear regression model allowing for response variables having other than a normal distribution.

Since we are evaluating counts, a GLM using Poisson distribution could be applied. The result remained the same: statistically non significant differences in count predation among regions. We started with ANOVA, then transformed the data obtaining variables with theoretically nice properties, estimated a GLM with Poisson distribution and, at the end, we were at the same point. There was something wrong. In fact, there was a subtile difference among regions: one of which had much more zero counts in contrast to other regions. These zero data could be treated in a more proper way.

The response to this problem appeared in the nineties: zero inflated Poisson models. These models are a way of dealing with overdispersion. The model assumes that the data is a “mixture” of two sorts of individuals: one group whose counts are generated by a standard Poisson regression model, and another group whose individuals have a large frequency of 0. Thus, this approach can take into account the excess in zero counts. Therefore, a zero inflated Poisson model (ZIP) was claimed to solve our problem. Moreover, in this setting, not only a Poisson can be assumed, but a Negative Binomial distribution could also be assumed (ZINB). This led me to further investigation, comparing ZIP and ZINB models with GLM with Poisson and NB distributions by using appropriate tests. The decision of using one or other model not only can be done from a statistical point of view but also using the biological interpretation.In this case, we saw that a ZINB model could model not only the count process for the data predation but also the process for zero predation.

The lesson of this story is that sometimes a simpler study can hide the most sophisticated analysis. Never underestimate the difficulty of a simple experiment because appearances can be often (and very often) deceiving.

By Anabel Blasco

Statistical Consultant

Servei d’Estadística Aplicada

Interview with…Isabel Martínez Silva

Isabel Martínez Silva is a researcher, statistical consultant and PhD candidate at the Biostatistics Unit of the University of Santiago de Compostela. Contact Isabel

1. Why do you like Biostatistics?

I find Biostatistics particularly interesting in the sense that not only doest it allow you to learn about Statistics but it also gives you  the opportunity to cooperate with professionals that require of our statistical knowledge for their research in the Bio sciences (Medicine, Biology, Odontology, Veterinary, etc.). It is true that there is a need for advances in Statistics and we work on that in our research projects, but it is also essential to share our knowledge and train professionals from other fields in the latest statistical techniques in order to provide better and more accurate results for their research work and so as to promote interdisciplinary, which is crucial for the improvement of any discipline.

2. Could you give us some insight in your current field of research?

My PhD work focuses on smoothed quantile regression and its applications in Biomedicine.

One of the most known examples for the general public of the applications of the technique in this field would be the study of growth curves. In general, every children´s growth is followed up by their paediatricians since they are born. In these revisions, measurements of weight, height, and age are taken and allow them to check the growth of infant population. The need for smoothing in this case  as well as the differences between boys and girls are patent.

My latest research in this area was presented at the JEDE II conference last July, and focuses on quantile regression hypothesis testing. We basically wonder whether boys and girls´growth distributions and percentiles are actually different. In the case the distributions were not different, we would not need to calculate different percentiles for each sex, and in case they were, it does not necessarily mean that the percentiles have to be different. In order to answer these two questions, bootstrap hypothesis testing has been applied that allows us to assess the statistically significant differences both between distributions and between each of the percentiles by sex.

3. Do you find it difficult to combine research and advice in Biostatistics?

Yes, I think it is particularly difficult, mainly because of the system inflexibility and the centers internal bureaucracy. For instance, in the medical environment, Biostatistics is usually understood as part of Epidemiology and in the statistical world, Biostatistics is also considered a subset of Statistics. I, personally, believe both notions are incomplete. Biostatistics starts within the Statistics frontiers but then crosses them when being complemented with the contributions from Epidemiology that do not have a place within purely mathematical subjects. Furthermore, in modern Biostatistics the use and creation of specific software for the implementation of statistical techniques is indispensable, and this is something outside Epidemiology aims. From my point of view, all these facts position biostatisticians within Statistics but always building bridges with the Bio environment to whom they must listen and try to understand so as to give value to the appropriate statistical techniques for each particular study.

4. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Statistics, Computing, and interdisciplinarity.

5. What do you think of the situation of young biostatisticians in Spain?

I believe it is very complicated and is mainly centered around universities.From my point of view, Biostatistics is nearly absent in Spain´s private sector and its presence in research centers and/or public foundations is unequal. Incorporating this to the state of the Spanish current market, makes the future of young biostatisticians outside the university really tough, contrary to what happens in Europe and US.

6. Which do you think are the main qualities of a good mentor?

Accesible, motivational, and innovative.

7. Finally, is there any topic you would like to see covered in the blog?

I find that it has covered a wide range of areas for the very short time that has been going on. Congratulations, you are doing a great job!!

Selected publications:

  • Martínez-Silva I., Lustres-Pérez V., Lorenzo-Arribas A., Roca-Pardiñas J., Cadarso-Suárez C. Flexible quantile regression models: application to the study of the sea urchin, Paracentrotus lividus (Lamarck, 1816). SORT (Under review).
  • Carballo-Quintás M, Martínez-Silva I, Cadarso-Suárez C, Álvarez-Figueiras M, Ares- Pena FJ, López Martín E. A study of neurotoxic biomarkers, c-fos and GFAP after acute exposure to GSM radiation at 900 MHz in the picrotoxin model of rat brains. Neurotoxicology, 32 (4),   pp:478-494 , August 2011. D.O.I.: http://dx.doi.org/10.1016/j.neuro.2011.04.003.
  • Cubiella Fernández J. , Núñez Calvo L. , González Vázquez E. ,  García García M. J. , Alves Pérez M. T. , Martínez Silva I. , Fernández Seara J. Risk factors associated with the development of ischemic colitis. World J Gastroenterol 16(36), pp. 4564-4569. September 2010.