More on handling data frames in R: dplyr package

I am currently taken an edX course, which is becoming one of my favorite online platforms, along with coursera. This time it is “Data Analysis for Genomics”, conducted by Rafael Irizarry, you might know him by his indispensable Simply Statistics blog. The course has just started, but so far the feeling is really good. It has been through this course that I have found about a new package by the great Hadley Wickhamdplyr. It is meant to work exclusively with data frames and provides some improvements over his previous plyr package.

As I spend quite a bit of time working with this kind of data, and having written a post some time ago about how to handle multiple data frames with the plyr package, I find it fair to update on this one. Besides I think it is extremely useful, and I have already incorporated some of its functions to my daily routine.

This package is meant to work with data frames and not with vectors like the base functions. Its functionalities might replace the ddply function in plyr package (one thing to mention is that is not possible to work with lists yet –as far as I know-). Four functions: filter( ) – instead of subset- , arrange( ) –instead of sort -, select( ) –equivalent to using select argument in subset function-, and mutate( ) – instead of transform- are, in my opinion, reason enough to move to this package. You can check here some examples on using these functions. The syntax is clearly improved and the code gets much neater, no doubt about it.

Two other essential functions are group_by( ) that allows to group the data frame by one or several variables and summarise( ) for calculating summary statistics on grouped data.

You can find general information about the package at the rstudio blog or several other blogs talking about its goodness, here or here.

Not to mention its other great advantage, the speed, not a minor issue for those of us who work with extremely large data frames. Several speed tests have been perfomed (here and here), and it seems to clearly outperform the speed of plyr or data.table packages.


I am so glad to have found it… I hope you will be too!


Statistical or Clinical Significance… that is the question!

Most of the times, results coming from a research project – specifically in the health sciences field – use statistical significance to show differences or associations among groups in the variables of interest. Setting up the null hypothesis as no difference between groups and the alternative showing just the opposite –i.e, there is a relationship between the analyzed factors –, and after performing the required statistical method, a p-value is provided. This p-value indicates, under an established threshold of significance (say, Type I or alpha error), the strength of the evidence against the null hypothesis. If the p-value is lower than alpha, results lead to a statistically significant conclusion; otherwise, there is no statistical significance.

According to my personal and other biostatisticians’ experience in the medical area, most of physicians are only interested in the statistical significance of their main objectives. They only want to know whether the p-value is below alpha. But, the p-value, as noted in the previous paragraph, gives limited information: essentially, significance versus no significance and it does not show how important the result of the statistical analysis is. Besides from significance, confidence intervals (CI) and measures of effect sizes (i.e., the magnitude of the change) should be also included in the research findings, as they can provide more information regarding the magnitude of the relationship of the studied variables (e.g., changes after an intervention, differences between groups,…). For instance, CIs facilitate the range of values within the true difference value of the studied parameter lies.

In clinical research is not only important to assess the significance of the differences between the evaluated groups but also it is recommended, if possible, to measure how meaningful the outcome is (for instance, to evaluate the effectiveness and efficacy of an intervention). Statistical significance does not provide information about the effect size or the clinical relevance. Because of that, researchers often misinterpret statistically significance as clinical one. On one hand, a large sample size study may have a statistically significant result but a small effect size. Outcomes with small p-values are often misunderstood as having strong effect sizes. On the other hand, another misinterpretation is present when non statistical significant difference could lead to a large effect size but a small sample may not have enough power to reveal that effect.
Some methods to determine clinical relevance have been developed: Cohen’s effect size, the minimal important difference (MID) and so on. In this post I will show how to calculate Cohen’s effect size (ES) [1], which is the easiest one.

ES provides information regarding the magnitude of the association between variables as well as the size of the difference of the groups. To compute ES, two mean scores (one from each group) and the pooled standard deviation of the groups are needed. The mathematical expression is the following:

                                                         ES = \frac{\overline{X}_{G1}-\overline{X}_{G2}}{SD_{pooled}}

where X_{G1} = mean of the group G1; X_{G2} = mean of the group G2; and SD_{pooled} is the pooled standard deviation which follows the next formula:

                                                         SD_{pooled} = \sqrt{\frac{s^2_{1}n_{1}+s^2_{2}n_{2}}{n_{1}+n_{2}-2}}

being n_{1} = sample size for G1; n_{2} = sample size for G2; s_{1} = the standard deviation of G1; s_{2} = the standard deviation of G2;

But, how can it be interpreted? Firstly, it can be understood as an index of clinical relevance. The larger the effect size, the larger the difference between groups and the larger the clinical relevance of the results. As it is a quantitative value, ES can be described as small, medium and large effect size using the cut-off values of 0.2, 0.5 and 0.80.
Clinical relevance is commonly assessed as a result of an intervention. Nevertheless, it can be also extended to any other non experimental study design types, for instance, for cross-sectional studies.
To sum up, both significances (statistical and clinical) are not mutually exclusive but complementary in reporting results of clinical research. Researchers should abandon the only use of the p-value interpretation. Here you have a starting point for the evaluation of the clinical relevance.

[1] Cohen J. The concepts of power analysis. In: Cohen J. editor: Statistical power analysis for the behavioral sciences. Hillsdale, New Jersey: Academic Press, Inc: 1998. p. 1-17.


Interview with…..

  Arantzazu Arrospide


 Arantzazu Arrospide Elgarresta studied mathematics in the University of the Basque Country (UPV/EHU) and works as a biostatistician in the Research Unit of Integrated Health Organisations in Gipuzkoa. This research unit gives support to four regional hospitals (about 100 beds each one) and all the public Primary Care Health Services in Gipuzkoa.

Email: arantzazu.arrospideelgarresta@osakidetza.net


Irantzu Barrio

fotoIrantzu  Acting teacher at the Department of Applied  Mathematics, Statistics and Operational Research of the    University of the Basque Country (UPV/EHU)

 Email: irantzu.barrio@ehu.es


Both young biostatisticians are currently working on several ongoing research projects. They belong to the Health Services Research on Chronic Patients Network (REDISSEC) – among others  biostatisticians – and tell us what they think about Biostatistics.

1.    Why do you like Biostatistics?

Irantzu Barrio: On one hand I like applying statistics to real problems, data sets and experiments. On the other hand, I like developing methodology which can contribute to get better results and conclusions in each research project. In addition, I feel lucky  to work in multidisciplinary teams. This allows me to learn a lot from other areas and constantly improve on mine own, always looking for ways to provide solutions to other researchers needs.

Arantzazu Arrospide: I think Biostatistics is the link between mathematics and the real world, giving us the opportunity to feel part of advances in scientific research.

2.    Could you give us some insight in your current field of research?

AA: Our main research line is the application of mathematical modeling the evaluation of public health interventions, especially economic evaluations. Although Markov Chain models are the most common methods for this kind of evaluations we work with discrete event simulation models which permit more flexible and complex modeling.

IB: I’m currently working on my PhD thesis. One of the main objectives of this work is to propose and validate a methodology to categorize continuous predictor variables in clinical prediction model framework. Specifically we have worked on logistic regression models and Survival Models.

3.    You have been doing an internship abroad. What was the aim of your stay?

IB: I did an internship in Guimaraes at the University of Minho, Portugal. During my stay, I worked together with Luis Filipe Meira Machado and María Xosé Rodriguez-Alvarez. The aim was to learn more about survival models and extend the methodology developed so far, considering different prediction models.

AA: I did a short stay in the Public Health department of the Erasmus Medical Centre in Rotterdam (Netherlands) last November. The aim of the visit was to discuss the validation of a discrete event simulation model developed to estimate the health effects and costs of the breast cancer screening program in the Basque Country.

4.    What did allow you to do that was has not been possible in Spain?

IB: Oh! It’s amazing when you realize you have all your time to work on your research project, one and a unique priority for more than two months. Of course, all the other to do’s did not disappeared from my calendar, only were postponed until my return to Bilbao. And, in addition to that, it was also a privilege to work together with high experienced biostatisticians and to have the opportunity to learn a lot from them.

AA: The research group I visited, internationally known as the MISCAN group, is the only European member of the Cancer Intervention and Surveillance Modeling Network (CISNET) created by the National Cancer Institute in the United States. Their main objective is to include modeling to improve the understanding of the impact of cancer control interventions on population trends in incidence and mortality. These models then can project future trends and help determine optimal control strategies. Currently, Spanish screening programs evaluation is mainly based on the quality indicators recommended by the European Screening Guidelines which do not include a comparison with an hypothetical or estimated control group.

5.    Which are the most valuable aspects to highlight during your internship? What aspects do you believe that might be improved?

IB: I would say that my internship was simply perfect. When I came back to Bilbao I just thought time had gone really really fast. I’m just looking forward to go back again.

AA: This group works for and in collaboration with their institutions. They are the main responsible of evaluation of ongoing screening programs, prospective evaluation of screening strategies and leaders for new randomized trials in this topic. This is the reference group in the Netherlands for cancer screening interventions and their institutions consider their conclusions when making important decisions.

6.    What do you think of the situation of young biostatisticians in Spain?

AA: When you work in a multidisciplinary research group both methodological and disease specific knowledge are essential and it takes a long time to achieve it. Institutional support is necessary to obtain long term funds that would ensure future benefits in healthcare research based on rigorous and innovative methods.

IB: I think the situations for young biostatisticians and for young people in general is not easy right now. And at least for what I see around me, there is lot of work to do for.

7.    What would be the 3 main characteristics or skills you would use to describe a good biostatistician? And the main qualities for a good mentor?

AA: Open minded, perfectionist and enthusiastic. As for the mentor, he/she  should be strict, committed and patient.

IB: In my opinion good skills on statistics, probability and mathematics are needed. But at the same time I think it is important to be able to communicate with other researchers such as clinicians, biologists, etc, specially to understand which are their research objectives and be able to translate bio-problems to stat-problems.

For me it is very important to have good feeling and confidence with your mentor. I think that having that, everything else is much easier. On the other hand, if I had to highlight some qualities, I would say that a good mentor would: 1) Contribute with suggestions and ideas 2) Supervise the work done and 3) be a good motivator.

8.    Finally, is there any topic you would like to see covered in the blog?

IB: I think the blog is fantastic, there is nothing I missed in it. I would like to congratulate all the organizing team, you are doing such a good job!!! Congratulations!!!

AA: Although it is not considered part of statistical science operational research methods also can be of interest in our researches.

Selected publications (6):

Arrospide, A., C. Forne, M. Rue, N. Tora, J. Mar, and M. Bare. “An Assessment of Existing Models for Individualized Breast Cancer Risk Estimation in a Screening Program in Spain.”. BMC Cancer 13 (2013).

Barrio, I., Arostegui, I., & Quintana, J. M. (2013). Use of generalised additive models to categorise continuous variables in clinical prediction. BMC medical research methodology13(1), 83.

Vidal, S., González, N., Barrio, I., Rivas-Ruiz, F., Baré, M., Blasco, J. A., … & Investigación en Resultados y Servicios Sanitarios (IRYSS) COPD Group. (2013). Predictors of hospital admission in exacerbations of chronic obstructive pulmonary disease. The International Journal of Tuberculosis and Lung Disease17(12), 1632-1637.

Quintana, J. M., Esteban, C., Barrio, I., Garcia-Gutierrez, S., Gonzalez, N., Arostegui, I., Vidal, S. (2011). The IRYSS-COPD appropriateness study: objectives, methodology, and description of the prospective cohort. BMC health services research11(1), 322.

Mar, J., A. Arrospide, and M. Comas. “Budget Impact Analysis of Thrombolysis for Stroke in Spain: A Discrete Event Simulation Model.”. Value Health 13, no. 1 (2010): 69-76.

Rue, M., M. Carles, E. Vilaprinyo, R. Pla, M. Martinez-Alonso, C. Forne, A. Roso, and A. Arrospide. “How to Optimize Population Screening Programs for Breast Cancer Using Mathematical Models.”.


Mixed Models in Sports and Health

As in many other research areas, mixed models have become widely applied in sports science and related health issues. Within sports science, different branches account for the interests of different stakeholders; general managers, coaches, teams, supporters, scientists, academics. Human performance and medicine (treatment and prevention of injuries) lie behind them all and these models are a way to account for within subject variability, time-dependent covariates, and nested data.

On the competition side, efforts have been made in the literature to find ways to model player performance. Casals and Martínez (2013) approach this problem in the context of the NBA basketball league results by considering a balanced design where player is the random intercept and the forward stepwise approach to model selection by Pinheiro and Bates (2000) has been adopted to determine additional factors to be included in the final model. The great advantage of their proposed models for points and win scores over previous studies is that they account for the variation in players´ characteristics and therefore can predict potential variation in player performance using Linear Mixed Models (LMM) and Generalized Linear Mixed models (GLMM), and are consequently of great help for coaches, managers and other decision makers.

A similar methodology has been followed to predict scoring ability in the English football Premier League by McHale and Szczepański (2014). You may recall the post by Hèctor Perpiñán on calculating results probabilities via simulation. While in Perpiñán´s study only teams´ results were considered for the calculations, McHale and Szczepański´s mixed modelling approach allows for the inclusion of players´ ability to create and convert shots as a disaggregated factor from chance. The accuracy in the prediction of their models in this paper shows again the relevance of models that allow the inclusion of players´ characteristics (rather than just teams´).

Of particular note to our readers is the trend towards the implementation of mixed models in open source software exemplified in the aforementioned papers which use R (R Core Team, 2012) for their modelling, in particular packages nlme and lme4

In community exercises for promoting physical activity like the one described in Okely et al. (2011), one research aim has been to determine policies and interventions that help to encourage exercising during adolescence in order to alleviate risk factors in the long run. This particular school-based randomised controlled trial used mixed models to account for the hierarchical structure of the data (32 schools from four geographical regions). According to the authors, one of the greatest advantages of the methodology is that it “incorporates all available data allowing for the analysis of partial datasets created when a participant drops out of the study or misses a study visit.”

Moving on to health applications, injury prediction, for instance in baseball pitchers and athletes, can be modelled by deploying similar approaches to determine the existence of statistically significant differences between groups and within days post-injury for example.

Finally, in veterinary epidemiology mixed modelling has become equally mainstream, as discussed in a recent article by Stryhn and Christensen (2014). As an example of an application, animal risk factors associated with health disorders in sport can also be modelled via these techniques. Visser et al. (2013) have studied the effect in race horses in the Netherlands via a cross-sectional study and have considered a random farm effect. Commercial software has been used in these two previous examples. – SAS (PROC MIXED) and GenStat (GLMM)- which are again of common use in the application of mixed models.

As stated by Casals and Martínez (2013), phenomena like Moneyball have raised the profile of sports data analysis. For researchers, big data and more widely, the intrinsic complex structure of the data coming from the aforementioned fields –and I would add software availability- have stimulated application of these types of models and they seem to be here to stay…

These are some of the examples that we have found but we will sure be missing some other very interesting ones so please let us know…Are you a researcher in the area and would like to tell us about your experience? Have you used this sort of model in this or other areas and are you willing to share your experience? 

Main references:

Casals, M., & Martinez, J.A. (2013). Modelling player performance in basketball through mixed models Journal of Performance Analysis in Sport, 13 (1), 64-82

McHale, I., & Szczepański, L. (2014). A mixed effects model for identifying goal scoring ability of footballers Journal of the Royal Statistical Society: Series A (Statistics in Society), 177 (2), 397-417 DOI: 10.1111/rssa.12015

Okely, A., Cotton, W., Lubans, D., Morgan, P., Puglisi, L., Miller, J., Wright, J., Batterham, M., Peralta, L., & Perry, J. (2011). A school-based intervention to promote physical activity among adolescent girls: Rationale, design, and baseline data from the Girls in Sport group randomised controlled trial BMC Public Health, 11 (1) DOI: 10.1186/1471-2458-11-658

Visser, E., Neijenhuis, F., de Graaf-Roelfsema, E., Wesselink, H., de Boer, J., van Wijhe-Kiezebrink, M., Engel, B., & van Reenen, C. (2013). Risk factors associated with health disorders in sport and leisure horses in the Netherlands Journal of Animal Science, 92 (2), 844-855 DOI: 10.2527/jas.2013-6692



Interview with…Manuel G. Bedia


Manuel G. Bedia is an assistant professor in the Department of Computer Science at the University of Zaragoza. He is one of the founders of the Spanish Network of Cognitive Science (retecog.net). This network has been established to promote and coordinate research in Cognitive Systems with goals overlapping those of the European Network EUCognition but with more emphasis on the relationships between scientific and educational policies, and the Spanish university system. He holds a BSc in Physics, a MSc.in Technological Innovation management and a Ph.D. in Computer Science and Artificial Intelligence (Best PhD Thesis Award, 2004), all from the University of Salamanca (Spain). He has worked as a Technological Consultant in Innovation and knowledge management (Foundation COTEC, Madrid, Spain) and as a research fellow in the field of artificial cognitive systems in the Department of Computer Science at the University of Salamanca, the Planning and Learning Group at the University Carlos III of Madrid (Visiting Professor, 2005-07) and the Multidisciplinary Institute at the University Complutense of Madrid (2004-07). He has also been a visiting postdoctoral researcher at the Institute of Perception, Action and Behavior (University of Edinburgh, 2005) and the Centre for Computational Neuroscience and Robotics at the University of Sussex, 2008.

1.     Your area of research is Cognitive Sciences. Could you give us a brief introduction about the focus of your work? 

Cognitive science is a space for interdisciplinary research where we aim to understand how the mind works. It joins together neuroscientists, psychologists, philosophers, engineers and of course statisticians too!

During the past five decades, analogies between the human mind/brain and computer software/hardware have led the work of researchers trying to understand how we think, reason and solve problems.

However, over the last few years, new conceptions have arisen doubting this conceptualisation. The biggest influence behind this change in perspective has come from engineers rather than scientists; in particular a group of engineers using the disciplinary tools of engineering to generate new scientific hypotheses instead of applying knowledge generated from other areas.

In a reversal of the usual role of engineers using models for the development of artifacts, the process develops tools to think about mind phenomena.

2. Could you give us an example of this?

Imagine we purposefully build a very simple artifact or software program that is capable of performing a certain task in a novel way. This proves the existence of explanatory alternatives to phenomena that were supposed to work in a certain way. In the words of other authors, the models serve as “mental gymnastics”. They are entities equivalent to classical mental experiments: They are artifacts that help our thinking. These tools are the foundations of modelling exercises: dynamic systems, probability theory, etc.

3. Is probability an important tool in your work?

It is indeed very important and relevant at many levels of the research in this area.

At a fundamental level, the mathematical languages that the early Artificial Intelligence (AI) researchers developed were not sufficiently flexible (they were based on the use of logic and rule systems) to capture an important characteristic of our intelligence: its flexibility to interactively reorganise itself. This led to a growing interest in tools that would embrace this uncertainty.

Recently a very interesting approach has been developed in the area where fundamental principles are based on probability: Artificial General Intelligence (AGI). The original goal of the AI field was the construction of “thinking machines” – that is, computer systems with human-like general intelligence. Due to the difficulty of this task, for the last few decades, the majority of AI researchers have focused on what has been called “narrow AI” – the production of AI systems displaying intelligence regarding specific, highly constrained tasks. In recent years, however, more and more researchers have reapplied themselves to the original goals of the field recognising the necessity and emergent feasibility of treating intelligence holistically. AGI research differs from the ordinary AI research by stressing the versatility and entirety of intelligence. Essentially, its main objective was to develop a theory of Artificial Intelligence based on Algorithmic Probability (further explanations can be found here).

At a more concrete level, there are several examples. For instance, it is well known that the reasoning model of the clinical environment is fundamentally Bayesian. The clinicians analyse and reflect on previous conditions and status of patients, before reaching a diagnosis of their current condition. This fits very well with the whole idea of Bayesian probability. Following the same line of reasoning, probability appears as a fundamental tool to model artificial minds thinking as humans.

In general, this Bayesian framework is the most used in our field.

4. How can this be applied in your area of research?

The Bayesian framework for probabilistic inference provides a general approach to understanding how problems of induction can be solved in principle, and perhaps how they might be solved in the human mind. Bayesian models have addressed animal learning , human inductive learning and generalisation, visual perception, motor control, semantic memory , language processing and acquisition , social cognition, etc.

However, I believe that the most important use comes from the area of neuroscience.

5. So what is the neuroscientific viewpoint in the field of the understanding of our mental functions, the Cognitive Sciences?

Neuroscience intends to understand the brain from the neural correlates that are activated when an individual performs an action. The advances in this area over the years are impressive but this conceptual point of view is not without problems. For instance, as Alva Noë states in his famous book Out of Our Heads, the laboratory conditions under which the measurements are taken substantially affect the observed task…This is a sort of second order cybernetics effect as defined by Margaret Mead decades ago. The history of neuroscience also includes some errors in the statistical analysis and inference phases…

6. Could you explain this further?

In the early 90s, David Poeppel, when researching the neurophysiological foundations of speech perception, found out that none of the six best studies of the topic matched his methodological apparatus (read more here).

Apparently, these issues were solved when functional magnetic resonance imaging (fMRI) emerged. As this technique was affordable it allowed more groups to work on the topic and indirectly forced the analytical methods to become more standardised across the different labs.

However, these images brought in a new problem. In an article in Duped magazine Margaret Talbot described how the single inclusion of fMRI images in papers had arguably increased the probability of these being accepted.

7.  You have also mentioned that big mistakes have been identified in the statistical analysis of data in the area. What is the most common error in your opinion?

In 2011 an eye-opening paper was published on this topic (find it here). The authors focused their research on the misreported significance of differences of significance.

Let’s assume one effect is statistically significantly different from controls (i.e. p<0.05), while another is not (p>0.05). On the surface, this sounds reasonable, but it is flawed because it doesn’t say anything about how different the two effects are from one another. To do this, researchers need to separately test for a significant interaction between the two results in question. Nieuwenhuis and his co-workers summed up the solution concisely: ‘…researchers need to report the statistical significance of their difference rather than the difference between their significance levels.’

The authors had the impression that this type of error was widespread in the neuroscience community. To test this idea, they went hunting for ‘difference of significance’ errors in a set of very prestigious neuroscience articles.

The authors analysed 513 papers in cognitive neurosciences in the five journals of highest impact (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience). Out of the 157 papers that could have made the mistake, 78 use the right approach whereas 79 did not.

After finding this, they suspected that the problem could be more generalised and went to analyse further papers. Out of these newly sampled 120 articles on cellular and molecular neuroscience published in Nature Neuroscience between 2009 and 2010, not a single publication used correct procedures to compare effect sizes. At least 25 papers erroneously compared significance levels either implicitly or explicitly.

8. What was the origin of this mistake?

The authors suggest that it could be due to the fact that people are generally tempted to attribute too much meaning to the difference between significant and not significant. For this reason, the use of confidence intervals may help prevent researchers from making this statistical error. Whatever the reasons behind the mistake, its ubiquity and potential effect suggest that researchers and reviewers should be more aware that the difference between significant and not significant events is not itself necessarily significant.

I see this as a great opportunity and a challenge for the statistical community, i.e., to contribute to the generation of invaluable knowledge in the applied areas that make use of their techniques.

Selected publications:

Bedia, M. & Di Paolo (2012). Unreliable gut feelings can lead to correct decisions: The somatic marker hypothesis innon-linear decision chains. FRONTIERS IN PSYCHOLOGY. 3 – 384, pp. 1 – 19 pp. 2012. ISSN 1664-1078

Aguilera, M., Bedia, M., Santos, B. and Barandiaran, X. (2013). The situated HKB model: How sensorimotor spatialcoupling can alter oscillatory brain dynamics. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE. 2013. ISSN 1662-5188

De Miguel, G and Bedia, M.G. (2012). The Turing Test by Computing Interaction Coupling. HOW THE WORLD COMPUTES: TURING CENTENARY CONFERENCE AND 8TH CONFERENCE ON COMPUTABILITY IN EUROPE, CIE 2012. Cambridge, ISBN 3642308694

Santos, B., Barandiaran, X., Husband, P., Aguilera, M. and Bedia, M. (2012). Sensorimotor coordination and metastability in a situated HKB model. CONNECTION SCIENCE. 24 – 4, pp. 143 – 161. 2012. ISSN 0954-0091


Spatial objects in R (I)

In this post I would like to offer a brief and practical introduction to different types of spatial data that we could handle in R and how to do to access and visualize them. It might be interesting for those who are beginners in spatial analysis.

The structure of spatial objects is complex, and it is very important to know how these are stored and organized. sp package provides a clear form to organize the spatial data through classes (to specify the structure ) and methods (particular functions for a special data class).

The basic type of spatial objects is Spatial class, in which we could recognize two principal components, called slots: the first one is a bounding box that contains a matrix with the coordinate values. The second is a CRS class object, the reference system for these coordinates (“longlat” for geographical coordinates, “utm” for UTM coordinates, ect). With the command getClass(“Spatial”) (once sp package has been installed), we can examine the different types of spatial objects and their subclasses.

I will use an example to make this something practical, specifically we use a Spanish cartography (“ESP_adm.zip”, downloaded in this web site: http://www.gadm.org/). To use the ‘readShapeSpatial’ function to read the .shp object, we need the library ‘maptools’.

# load the shapefile:
map <- readShapeSpatial('ESP_adm2.shp')
# map is a SpatialPolygonsDataFrame, a subclass of a SpatialPolygons object.

 A Spatial Polygons object is a closed line, a sequence of point coordinates where the first point is the same as the last point. It is not easy to know their structure, let us look at this briefly:

# examine the structure of this SpatialPolygonsDataFrame object.
# Each slot (5) is determined by ‘@name_slot’
str(map, max.level = 2)

But, what means each slot?

  • @data: a data frame which contains information about each polygon. (For instance, the name of provinces, autonomous community,…). To access:
# The provinces:
map@data$NAME_2 # or slot(map, ‘data’)$NAME_2
# Idem for autonomous community with ‘$NAME_1’
  • @polygons: a list of 51 polygons (all provinces), of class Polygon. Each of those polygons have 5 slots, we can see the structure of the first polygon with:
  • @plotOrder: a vector of the plotting order of the 51 polygons.
  • @bbox: two-dimensional matrix with the maximum and minimum of the coordinates.
  • @proj4string: the coordinate reference system (CRS class).

Finally, I will give just a simple example of how to plot some geographical coordinates in the ‘map’. For this, we read a data file with the locations and associate a spatial structure creating a SpatialPoints class.  We need to be careful with the reference system of the SpatialPolygons and the SpatialPoints we used. The coordinate reference system for our shapefile is latitude/longitude and the WGS84 datum. Thus, we need the same CRS to our SpatialPoints. But…what if we do not have geographic coordinates? Let’s see it!

Assume that the coordinate reference system of the SpatialPoints is UTM (Universal Transverse Mercator) instead of latitude/longitude.

loc_utm <- read.csv(“loc_utm.csv”,header=T)

> loc_utm
X       Y
1  718970.0 4376275.0
2 505574.2 4793683.2
3  276066.1 4542666.9
4  538135.4 4750010.9

There are four locations situated (in order) in Valencia, Bilbao, Salamanca and Santiago de Compostela. The first three points are in UTM zone 30 and the last one in zone 29. It is important to know this to define the CRS correctly:

# We separate these points according to the zone:
loc1 <- loc_utm[1:3,]
loc2 <- loc_utm[4,]

# To create a SpatialPoints:
coordinates(loc1) <- c("X","Y")
coordinates(loc2) <- c("X","Y")

# To define the CRS:
proj4string(loc1) <- CRS("+proj=utm +zone=30 +ellps=WGS84")
proj4string(loc2) <- CRS("+proj=utm +zone=29 +ellps=WGS84")

# spTransform function provide transformation between projections.
loc1_geo<- spTransform(loc1, CRS("+proj=longlat +ellps=WGS84"))
loc2_geo<- spTransform(loc2, CRS("+proj=longlat +ellps=WGS84"))

Now we can plot these points over the Spanish mapping.


This post has been a short summary of some of Spatial objects that we can manipulate in R. In the following posts I will invite you to continue learning about this area of statistics. I hope it helps someone!


Working with Joint Models in R

Following previous posts on longitudinal models (see posts) and interviews where mentions have been made to survival analysis (read one of these interviews), we will focus today on models incorporating both methodologies, the so-called Joint Models.

Often problems pose more than a single approach. For example, when HIV data are analyzed, we can on the one hand focus on studying the evolution of CD4 cell counts (longitudinal model) and on the other hand, we can also study the time to death (survival analysis model). Joint modeling will allow both models to share information and will achieve better results.

From a practical point of view, in R, we can use two packages called JM and JMbayes to fit these models, under either the frequentist or the Bayesian paradigm, respectively. The use of these packages is not difficult because they are based on the use of packages nlme and survival with which you might already be familiar.

Here I show you how to use them:

    • We loaded packages JM and JMbayes
    • We fit the longitudinal and survival models independently, saving the results in objects
# Fit the longitudinal model
lmeFit.aids <- lme(CD4 ~ obstime * drug, random = ~obstime | patient, data = aids)
# Fit the survival model
survFit.aids <- coxph(Surv(Time, death) ~ drug, data = aids.id, x = TRUE)
    • We use the objects lmeObject and survObject with the functions in the packages JM and JMbayes
# Fit the joint model (Frequentist)
jointFit.aidsFREQ <- jointModel(lmeFit.aids, survFit.aids, timeVar = "obstime")
# Fit the joint model (Bayesian)
jointFit.aidsBAYES <- jointModelBayes(lmeFit.aids, survFit.aids, timeVar = "obstime")

It’s possible to specify different characteristics about the joint model. With JM package various options for the survival model are available. With JMbayes it is possible to choose the type of association structure between the longitudinal and survival processes.

Finally, if you are interested in these models and are thinking about using them in your research, a good reference is the book Joint Models for Longitudinal and Time-to-Event Data: With Applications in R of Dimitris Rizopoulos, author of JM and JMbayes.

Note: The databank to fit the longitudinal model must have as many rows as observations per patient have (aids). On the other hand, the survival model requires a database that has only one observation for each patient over time to death and censoring indicator (aids.id).


Have an animated 2014!

Animations can be a refreshing way to make a website more attractive, as we did exactly a year ago here.

Based on the Brownian motion simulated here, and using animation and ggplot2 R packages, we produced a fun welcome to the International year of Statistics, Statistics2013. The function saveMovie (please notice that there exists another alternative, saveGIF) allowed us to save it and finally publish it, as easy as that!

For this new year we have based our work on the demo(‘fireworks’) from the 2.0-0 version of the package animation (find  here a list of the changes in each version) and world from the package maps (speaking of maps, have a look at this great dynamic map by Bob Rudis!).

They also come very handy when trying to represent visually a dynamic process, the evolution of a time-series,etc.

As an example, in a previous post we  portrayed an optimisation model in which the values for the mean were asymptotically approaching the optimal solution, which was achieved after a few iterations. This was done with a for loop and again packages ggplot2 and animation.

There are many other examples of potential applications both in general Statistics – see this animated representation of the t distribution and these synchronised Markov chains and posterior distributions plots- and  in Biostatistics. Some examples of the latter are this genetic drift simulation by Bogumił Kamiński and this animated plots facility in Bio7, the integrated  environment for ecological modelling.

R package caTools also allows you to read and write images in gif format.

Finally, in LaTeX, \animategraphics from the animate package will do the trick; check this post by Rob J. Hyndman for further details.

It is certainly one of our new year’s resolutions to incorporate more animations in our posts, what are yours?


Interview with… Anabel Forte


Anabel Forte Deltell, graduated in mathematics and statistics, holds a PhD in Statistics from the Univeristy of Valencia. Now she is a lecturer at the economics department of Universitat Jaume I at Castellón. Her research topics cover many fields of statistics as, for instance, spatial analysis or joint modeling of longitudinal (or panel) data and time-to-event data with her main interest beeing Bayesian model selection and, in particular, Objective Bayesian variable selection. Her dissertation (2011) entitled “Objective Bayes Criteria for Variable Selection” focuses on this last topic. She has published various articles in international journals and participated in many national and international meetings.

Email: forte@uji.es

 1.   Why do you like Biostatistics?

Researching in biostatistics is researching in real life and it makes me feel like I’m helping to improve people´s life conditions.

 2. Could you give us some insight in your current field of research?

Despite my main line of research being Bayesian model selection, I’m actually working in joint modeling of longitudinal data and time-to-event data. Nowadays many practitioners are moving to what it is called personalized medicine with medical decisions, practices, and/or products being tailored to the individual patient. In this scenario longitudinal data seems to be of great importance since it allows for the consideration of several sources of uncertainty including patient-specific variability. Moreover, quantifying the risk of suffering a certain event of a patient given its trajectory seems really sensible and can be done using joint modeling.

 3. Which are, in your opinion, the main advantages of being a researcher?

For me, being a researcher is simply a way of life. I can not imaging myself doing anything else.

 4. Your whole professional experience has been within the public sector and the University. How do you see the present and future of research in the Spanish public sector?

From my point of view if things do not change greatly, researching in Spain, at least in the public sector, has a really black future. As I see it, the only way researchers can manage to get money is to call attention of private investors. But private money can compromise the objectivity of research and at the same time avoid researching in some fields that are not so “attractive” for industry. Hence it is something that has to be studied carefully… but something has to be done, that´s for sure.

 5. What do you think of the situation of young biostatisticians in Spain?

As I see it, young biostatisticians have to make a great effort to show their enormous value for private companies as, for instance, food processing companies or health care related companies among others… And I think so because, pitifully, nowadays the Spanish public system can not assume all the biostatistical research that is needed.

 6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

For me it is really important to have a good mathematical/probabilistic base and to be well organized when working. But the most important of all is enjoying research.

 7. Which do you think are the main qualities of a good mentor?

For me, a good mentor is someone that supervises your work and is always there for you but at the same time he or she should give you some space for you to fail and learn… in other words someone that teaches you what researching really is.

Selected publications:

  • Francisco Pozo-Rodríguez; Jose Luis López-Campos; Carlos J. Álvarez-Martínez; Ady Castro-Acosta; Ramón Agüero; Javier Hueto; Jesús Hernández-Hernández; Manuel Barrón; Victor Abraira; Anabel Forte; Juan Miguel Sanchez Nieto; Encarnación Lopez-Gabaldón; Borja G. Cosío; Alvar Agusti. Clinical Audit of COPD patients requiring hospital admissions in Spain: AUDIPOC Study.Plos One. 7 – 7, (USA): 2012.
  • M. J. Bayarri; J. O. Berger; A. Forte; G. García-Donato. Criteria fo Bayesian model choice with application to variable selection. Annals of Statistics. 40 – 3, pp. 1550 – 1577. (USA): 2012.
  • C. Armero; A. Forte; A. López-Quílez. Geographical variation in pharmacological prescription. Mathematical and Computer Modelling. 50, pp. 921 – 928. (UK): 2009.
  • Allepuz; A. López-Quílez; A. Forte; G. Fernández; J. Casal. Spatial Analysis of Bovine Spongiform Encephalopathy in Galicia, Spain (2000-05). Preventive Veterinary Medicine. 79 – 2, pp. 174 – 185. (Holland): 2007.
  • López-Quílez; C. Armero; A. Forte. Geographical variation of Pharmacological prescription with Bayesian Hierarchical models.Value In Health. 10 – 6, (USA): 2007.

Hasta la vista, Statistics2013!

After twelve months filled with exciting events celebrating the great role of statistics in society, it is now time to say goodbye to the International Year of Statistics, Statistics2013.


Illustration: Cinta Arribas

From FreshBiostats we would like to thank the steering committee for making of this such a fantastic opportunity for professionals and students at all levels to interact and share the pride in what we do. We would like to send a special thank you to Ronald Wasserstein and Jeffrey Myers who helped us feel involved in the celebration with our small contributions to their newsletter and blog.

The good news is that it seems like there will be a new project starting in late January. We are really looking forward to The World of Statistics! – find here a whole lot of new activities already arranged for 2014-.


How to make an R package repository: an R-recipe

It is well known that R is one of the most used statistical package by the data scientists (researchers, technicians, etc..). This package allows flexibility to adapt already developed functions to their own interests. However, the problem comes when one needs to perform a complex algorithm that requires strong programming level or even to show to worldwide scientists how a new statistical technique works. When this scenario is present, it means that it is time to create your own package in R.

The most In this post we will learn how to make a simple R package only following several steps which are explained in the next paragraphs:

1. Update R version and other relevant tools.

First of all, if you are planning to develop a R package, it is highly recommended to update your R version to newest one. One must take into account that R versions are continuously updating so that your package would be likely to not to work properly. Furthermore, the package called Rtools is necessary to install in your system. This package has several add-ins (such as MinGW compiler) which are helpful for the development of the package.

2. Get to know your Operative System

For Windows System, you have to add to the PATH system variable the whole path where R is installed. For instance, if R is installed in c:, the PATH can be fixed as:



3. Decide the name and the kind of the package wanted to build.

This step is the most important. First of all, the package name can only be composed by numbers and letters and it is highly recommended to start it with a letter.

When building the functions belonging to the package, it is very important the arrangement of the files where the functions are written. There are two possible situations: (1) all functions in a file; (2) each function in its own file.  Both options are extreme since (1) writing all functions in a file could overload the system and (2) writing many functions in its own files implies having multiple files that should be packed. So, in order to search an alternative to these options, I suggest grouping related functions into the same files.

4. Edit some files

Once the previous steps are done, several files should be filled out/edited so that any R user who wants to use the functions within the package is able to understand what the function does.

First of all, it is obligatory to fill in  a file named DESCRIPTION. It is used to describe the package, with author, and license conditions in a structured text format that is readable by computers and by people.

Secondly, a tutorial of the package (called as VIGNETTES) should be written. An extended description with reproducible examples  shows the performance of each function within the repository.

5. Building, installing and executing the package

As final step, you have to mix all the components in a file by means of different software commands. Using the package.skeleton() function (see below) one can create the desired mypkg package:


where f, g, h, i¸ are all the functions belonging to the repository.

Once mypkg is created, the next step is to install it in our R system. You can use the common ways to perform it but I would advise using R commands to do that (R CMD INSTALL). Load the package and execute it. It is time to enjoy it!!

This has been a summary of the steps to follow for a R package repository development. Now it is your challenge!!!

Merry Christmas and Happy New Year!


Analysis of PubMed search results using R

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across  this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980:

query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing"
ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000)
ngs_records <- EUtilsGet(ngs_search)
years <- Year(ngs_records)
ngs_pubs_count <- as.data.frame(table(years))

This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at  FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year:


I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal:

journal <- MedlineTA(ngs_records)
ngs_journal_count <- as.data.frame(table(journal))
ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]
Again, the complete code that allows us to normalize the data by the total number of publications per journal, as well as the following barplots showing the result, is available at our Gist:



You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs.

Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog.

Search and analyze carefully!


My impressions from the Workshop on the Future of the Statistical Sciences

An exciting workshop, organised as the capstone of the International Year of Statistics Statistics2013, was celebrated on the 11th and 12th of November at the Royal Statistical Society Headquarters in London, under the stimulating title of  “The Future of the Statistical Science”, science to which John Pullinger, president of the Royal Statistical Society, refers to as a quintessentially, multidisciplinary discipline. The workshop was indeed a good example of that.

Online attendance was welcomed to the event giving an additional opportunity to listen to highly reputed professionals in main areas of Statistics. Virtual participants were also allowed to pose their questions to the speakers, an innovation that worked very well as an excellent way to make these sorts of events available to a wider audience that would otherwise be excluded – I am already looking forward to more! -.

In case you missed it at the time, luckily the podcast is still available here.

True to its description, the event covered different areas of application, it showed the tools of the field for tackling a wide range of challenges, it portrayed potentially high impact examples of research, and in summary, was a great attention-grabbing exercise that will hopefully encourage other professionals to the area. A common vision of ubiquity and great relevance came across from the speeches and showed the field as a very attractive world to join. See my simplified summary below in form of a diagram certainly reflective of the great Statistics Views motto “Bringing Statistics Together”.


Fig 1. Some of the ideas discussed in the workshop

Particularly relevant to this blog and my particular interests were presentations on environmental Statistics and statistical Bioinformatics, as well as other health-related talks. Important lessons were taken on board from the rest of the speakers too.

The workshop started with a shared presentation on the hot topic “Statistical Bioinformatics” by Peter Bühlmann (ETH Zürich) and Martin Vingron (Free University of Berlin). In a fascinating talk, Bühlmann argued for the need of uncertainty quantification in an increasingly heterogeneous data world –examples of this are also appearing in other fields, e.g. in the study of autism disorders as Connie Kasari and Susan Murphy mentioned in “SMART Approaches to Combating Autism”- and for models to be the basis of the assignation of uncertainties – topic greatly covered by Andrew Gelman in “Living with uncertainty, yet still learning”-, while acknowledging that in the big data context, “confirmatory statistical inference is (even more) challenging”. Vingron followed by focusing on “Epigenomics as an example”, raising open questions to the audience on how to define causality in Biology where “most processes […] are feedback circular processes”, calling for models that are just complex enough so as to allow for mechanistic explanations, and for good definitions of null hypotheses.

In addition to the interesting points of the talk, I found its title particularly attractive in what it could be directly linked to a vibrant roundtable on “Genomics, Biostatistics and Bioinformatics” in the framework of the 2nd Biostatnet General Meeting, in which, as some of you might remember, the definitions of the terms Biostatistics and Bioinformatics were discussed. I wonder if the term “statistical Bioinformatics” would be indeed the solution to that dilemma? As a matter of fact, Bühlmann himself mentions at the start of his talk other options like Statistical Information Sciences, etc-

Michael Newton and David Schwartz from the University of Wisconsin-Madison also focused on the triumph of sequencing and “Millimeter-long DNA Molecules: Genomic Applications and Statistical Issues”. Breast cancer being one of the mentioned applications, this was followed by an introduction to “Statistics in Cancer Genomics” by Jason Carroll and Rory Stark (Cancer Research UK Cambridge Institute, University of Cambridge), particularly focusing on breast and prostate cancer and the process of targeting protein binding sites.  The latter, as a computational biologist, mentioned challenges on the translation of Statistics for the biologists and vice versa, and identified ”reproducibility as (the) most critical statistical challenge” in the area – also on the topic of reproducibility, it is especially worth watching “Statistics, Big Data and the Reproducibility of Published Research” by Emmanuel Candes (Stanford University) -.

Another area increasingly gaining attention in parallel with technologies improvements is Neuroimaging. Thomas Nichols (University of Warwick) lead the audience in a trip through classical examples of its application (from Phineas Gage´s accident to observational studies of the structural changes in the hippocampi of taxi drivers, and the brain´s reaction to politics) up to current exciting times, with great “Opportunities in Functional Data Analysis” which Jane-Ling Wang (University of California) promoted in her talk.

Also in the area of Health Statistics, different approaches to dietary patterns modelling were shown in “The Analysis of Dietary Patterns and its Application to Dietary Surveillance and Epidemiology” by Raymond J. Carroll (Texas A&M University), and Sue Krebs-Smith (NCI), with challenges in the area being; finding dietary patterns over the life course – e.g. special waves of data would appear in periods like pregnancy and menarchy for women-, and incorporation of technology to the studies as a new form of data collection –e.g., pictures of food connected to databases-.

Again with a focus on the challenges posed by technology, three talks on environmental Statistics outlined an evolution over time of the field. Starting from current times, Marian Scott (University of Glasgow) in “Environment – Present and Future” stated in her talk that “natural variability is more and more apparent” because we are now able to visualise it. This idea going back to the heterogeneity claimed by Bühlmann. Despite the great amounts of data being available, and initially caused by technological improvements but ultimately due to a public demand – “people are wanting what it is happening now”-, the future of the field is still subject to the basic questions: “how and where to sample.” Especially thought-provoking were Scott´s words on “the importance of people in the environmental arena” and the need for effective communication: “we have to communicate: socio-economic, hard science…, it all needs to be there because it all matters, it applies to all the world…”

Amongst other aims of this environmental focus such as living sustainably – both in urban and rural environments-, climate is a major worry/fascination which is being targeted through successful collaborations of specialists in its study and statisticians. “Climate: Past to Future” by Gabi Hegerl (University of Edinburgh) and Douglas Nychka (NCAR) covered “Regional Climate – Past, Present and Future” and showed relevant examples of these successful collaborations.

Susan Waldron (University of Glasgow), under the captivating speech title of “A Birefringent Future”, highlighted the need to address challenges in the communication of “multiple layers of information” too, Statistics appearing as the solution for doing this effectively –e.g. by creating “accessible visualisation of data” and by incorporating additional environmental variation in controversial green energy studies such as whether wind turbines create a microclimate-. Mark Hansen (Columbia University Journalism School) seconded Waldron´s arguments and called for journalists to play an important role in the challenge: “data are changing systems of power in our world and journalists. […] have to be able to think both critically as well as creatively with data”, so as to “provide much needed perspective”. Examples of this role are; terms such as “computational Journalism” already being coined and data visualisation tools being put in place -as a fascinating example, the New York Times´ Cascade tool builds a very attractive representation of the information flow in Social Media-. David Spiegelhalter (Cambridge University) also dealt with the topic in the great talk “Statistics for an Informed Public”, providing relevant examples (further explanation on the red meat example can be found here as well).

To encourage “elicitation of experts opinions (to reach a) rational consensus” and “to spend more time with regulators” came up as other useful recommendations in the presentation “Statistics, Risk and Regulation” by Michel Dacorogna (SCOR) and Paul Embrechts (ETH Zürich).

In a highly connected world inmersed in a data revolution, privacy becomes another major issue. Statistical challenges arising from networked data –e.g. historical interaction information modelling- were addressed by Steve Fienberg (Carnegie Mellon University) and Cynthia Dwork (Microsoft Research), who argued that “statisticians of the future will approach data and privacy (and its loss) in a fundamentally algorithmic fashion”, in doing so explicitly answering the quote by Sweeney: “Computer science got us into this mess, can computer science get us out of it?”. Michael I. Jordan (University of California-Berkeley) in “On the Computation/Statistics Interface and “Big Data”” also referred to “bring(ing)  algorithmic principles more fully into contact with statistical inference“.

I would like to make a final mention to one of the questions posed by the audience during the event. When Paul Embrechts was enquired about the differences between Econometrics and Statistics, a discussion  followed on how crossfertilisation between fields has happened back and forth. As a direct consequence of this contact, fields rediscover issues from other areas. For instance, Credit Risk Analysis models were mentioned as being inferred from Medical Statistics or back in Peter Bühlmann´s talk, links were also found between Behavioural Economics and Genetics. These ideas, from my point of view, bring together the essence of Statistics, i.e. its application and interaction with multiple disciplines as the foundations of its success.

Many other fascinating topics were conveyed in the workshop but unfortunately, I cannot fit here mentions to all the talks. I am sure you will all agree it was a fantastic event.


Setting up your (Linux biostatistical) workstation from scratch


Facundo Muñoz, MSc in Mathematics, PhD in Statistics from University of Valencia. He is currently a postdoc researcher at the french Institute National de la Recherche Agronomique (INRA). His main research field is spatial (Bayesian) statistics applied to environmental and biostatistical problems. He is currently working on statistical methodologies for the analysis of forest genetic resources.

Being a busy biostatistician, I spend plenty of time glued to my computer. As an immediate consequence, every once in a while I need to set up my working station from zero. Either when I change job (I have done twice this year!), or when I want to update my OS version (upgrades rarely go perfect), or when I get a new laptop.

This involves, you know, installing the OS, the main programs I need for work like R and LaTeX, some related software like a good Control Version System, a couple of Integrated Development Environments for coding and writing, and a dozen of other ancillary tools that I use every now and then.

Furthermore, I need to configure everything to run smoothly, set up my preferences, install plugins, and so on.

Last time I did this manually, I spent a week setting everything up, and in the following days I always had something missing. Then I thought I should have got this process scripted.

Last week I set up my working environment in my new job. In a few hours I had everything up and running exactly the way I like. I spent an aditional day updating the script with new software, updated versions, and solving some pending issues.

I thought this script might be useful for others as well, hence this post. It is version-controled in a google code repository, where you can download the main script.

It is not very general, as installation details changes a lot from system to system. I use Linux Mint, but I believe it should go pretty straightforward with any derivative of Ubuntu, or Ubuntu itself (those distros using the APT package management). Other Linux branches (Arch, RedHat, Suse, Mac’s Darwin) users would need to make significant changes to the script, but still the outline might help. If you use Windows, well… don’t.

Of course, you will not be using the same software as I do, nor the same preferences or configurations. But it might serve as a guide to follow line by line, changing things to suit your needs.

In particular, it provides an almost-full installation (without unnecessary language packages) of the very latest LaTeX version (unlike that in the repos), and takes care of installing it correctly. It also sets up the CRAN repository and installs the latest version of R.

The script also installs the right GDAL and Proj4 libraries, which are important in case you work with maps in R or a GIS.

Finally, it installs some downloadable software like Kompozer (for web authoring), the Dropbox application, and more. It scrapes the web in order to fetch the latest and right versions of each program.

I hope it helps someone. And if you have alternative or complementary strategies, please share!


FreshBiostats´ First Anniversary

So here we are. It has been a year since we started this venture. The idea of a blog came up from one of our co-bloggers at the Jede II Conference in the summer of 2012.  At first it sounded like a bit of a challenge, but who said fear?

No doubt about it, the balance has been highly positive. We are all for sharing knowledge and resources that might be valuable for others, and from our humble perspective we sincerely hope it might have been of some use. It has certainly been so for us, both by getting insight into particular subjects when writing the different posts and by diving into new topics covered by our co-bloggers and invited posts. Twitter and facebook have also allowed us to encourage interaction with colleagues and other bloggers, and we can now say our social and professional networks have certainly become bigger and stronger!

We have found it difficult at times to juggle our jobs and PhDs with writing our weekly posts but as we said in several occasions, we are passionate about our work, and firmly believe that, most of the time, the line between work and fun gets blurry.

As we promised in a previous post, here is an infographic summarising this year of Fun & Biostatistics, enjoy!


We have a very international audience with visits coming from 117 countries, and we are delighted to see that not only colleagues from our closest network are reading our entries. Since our participation on the International Statistics Year blog -Statistics2013- and after being mentioned in other blogs such as RBloggers and others, we have gained more visibility and some posts have become very popular (more than 1000 views for some!).

Those posts focusing on R tips clearly take the cake, being the most visited. We guess they might be the most useful ones, as we are also big fans of other very practical blogs. However we like to cover all the aspects of our profession and even sometimes deal with more controversial or philosophical subjects…

We will keep inviting people to share their knowledge and will encourage colleagues to get involved in the blog. Our second year resolution is to make an effort to make of this blog a more interactive tool. We count on you for that!

Remember you can contact us with your comments, suggestions, and enquiries at freshbiostats@gmail.com

Thank you so much for being there!


Interview with…Jorge Arribas


Jorge is a BSc Pharmacy from the UPV-EHU and he is currently a resident trainee in Microbiology and Parasitology at the H.C.U. Lozano Blesa in Zaragoza, working also on his PhD thesis.

Email: Jarribasg(at)salud(dot)aragon(dot)es

 1.     Could you give us some insight in your current field of research?

My PhD thesis focuses on the diagnosis of Hepatitis C Virus (HCV) infection by means of core antigen determination, and I am also collaborating in a line of research focusing on new treatments for H. pylori infection, funded by a FIS grant. The former, intends to perform a comparison with respect to the current technique in use for the diagnosis. The latter, analyses resistance to different Flavodoxin inhibitors.

2.     Where does Biostatistics fit in your daily work?

In both areas of research that I am working on, since they are essentially comparative studies against established techniques, and therefore require techniques to prove the significance – or lack of significance- of the improvements.

3.     Which are the techniques that you or researchers in your area use more often?

Statistical techniques such as sensitivity and specificity analysis, hypothesis testing (ANOVA, t-test). There is also a particular need in the area for techniques dealing with ordinal data.

4.     As a whole, do you find Biostatistics relevant for your profession?

A very important part of the speciality of Microbiology and Parasitology focuses on the research of new diagnostic methods, treatments, prevalence of antibiotic resistance, etc. Therefore, Biostatistics becomes extremely useful when comparing these novel approaches to previous ones.

5.     Finally, is there any topic you would like to see covered in the blog?

It would be great to see published some examples of statistical applications in my area of study.

Selected publications:

  • J. Arribas, R. Benito , J. Gil , M.J. Gude , S. Algarate , R. Cebollada , M. Gonzalez-Dominguez , A. Garrido, F. Peiró , A. Belles , M.C. Rubio (2013). Detección del antígeno del core del VHC en el cribado de pacientes en el programa de hemodiálisis. Proceedings of the XVII Congreso de la Sociedad Española de Enfermedades Infeccionas y Microbiología Clínica (SEIMC).
  • M. González-Domínguez, C. Seral, C. Potel, Y. Sáenz, J. Arribas, L. Constenla, M. Álvarez, C. Torres, F.J. Castillo (2013). Genotypic and phenotypic characterisation of methicillin-resistant Staphylococcus aureus (MRSA) clones with high-level mupirocin resistance in a university hospital. Proceedings of the 23nd European Congress of Clinical Microbiology and Infectious Diseases and 28th International Congress of Chemotherapy.
  • M. González-Domínguez, R. Benito, J. Gil,  MJ. Gude, J. Arribas, R. Cebollada, A. Garrido, MC. Rubio (2012).  Screening of Trypanosoma cruzi infection with a chemiluminiscent microparticle immunoassay in a  Spanish University Hospital. Proceedings of the 22nd European Congress of Clinical Microbiology and Infectious Diseases and 27th International Congress of Chemotherapy.

FreshBiostats birthday and September-born famous statisticians

With the occassion of the 1st birthday of FreshBiostats, we want to remember some of the great statisticians born in September and that have contributed to the “joy of (bio)stats”.

Gerolamo Cardano Pavia, 24 September 1501 – 21 September 1576 First systematic treatment of probability
Caspar Neumann Breslau, 14 September 1648 – 27 January 1715 First mortality rates table
Johann Peter Süssmilch Zehlendorf, 3 September 1707 – 22 March 1767 Demographic data and socio-economic analysis  
Georges Louis Leclerc (Buffon) Montbard, 7 September 1707 – Paris, 16 April 1788 Premier example in “geometric probability” and a body of experimental and theoretical work in demography
Adrien-Marie Legendre Paris, 18 September 1752 – 10 January 1833 Development of the least squares method
William Playfair Liff, 22 September 1759 – London, 11 February 1823 Considered the founder of graphical methods of statistics (line graph, bar chart, pie chart, and circle graph)
William StanleyJevons Liverpool, 1 September 1835 – Hastings,13 August 1882 Statistical atlas – graphical representations of time series
Anders Nicolai Kiaer Drammen, 15 September 1838 – Oslo, 16 April 1919 Representative sample
Charles Edward Spearman London, 10 September 1863 – 17 September 1945 Pioneer of factor analysis and Spearman´s Rank correlation coefficient
Anderson Gray McKendrick Edinburgh, September 8, 1876 – May 30, 1943 Several discoveries in stochastic processes and collaborator in the path-breaking work on the deterministic model for the general epidemic
Maurice Fréchet Maligny, 2 September 1878 – Paris, 4 June 1973 Contributions in econometrics and spatial statistics
Paul Lévy 15 September 1886 – 15 December 1971 Several contributions to probability theory
Frank Wilcoxon County Cork, 2 September 1892 – Tallahassee, 18 November 1965 Wilcoxon rank-sum tests, Wilcoxon signed-rank test
Mikhailo Pylypovych Kravchuk Chovnytsia, 27 September 1892- Magadan, 9 March 1942 Krawtchouk polynomials, a system of polynomials orthonormal with respect to the binomial distribution
Harald Cramér Stockholm, 25 September 1893 – 5 October 1985 Important statistical contributions to the distribution of primes and twin primes
Hilda Geiringer Vienna, 28 September 1893 – California, 22 March 1973 One of the pioneers of disciplines such as molecular genetics, genomics, bioinformatics,…
Harold Hotelling Fulda, 29 September 1895 – Chapel Hill, 26 December 1973 Hotelling´s T-squared distribution and canonical correlation
David van Dantzig Rotterdam, 23 September 1900 -Amsterdam,  22 July 1959 Focus on probability, emphasizing the applicability to hypothesis testing
Maurice Kendall
Kettering, 6 September 1907 – London, 29 March 1983 Random number generation and Kendall´s tau
Pao-Lu Hsu Peking, 1 September 1910 – Peking, 18 December 1970 Founder of the newly formed discipline of statistics and probability in China

It is certainly difficult to think of the field without their contributions. They are all a great inspiration to keep on learning and working!!

Note: you can find other interesting dates here.

Update: and Significance´s timeline of statistics here.

Any author you consider particularly relevant? Any other suggestions?


Infographics in Biostatistics

Although the history of Infographics according to Wikipedia does not seem to mention Fritz Kahn as one of the pioneers of this technique, I would like to start this post mentioning one of the didactic graphical representations of this Jewish German doctor, who was highly reputed as a popular science writer of his time.

Apart from his fascinating views of the human body in the form of machines and industrial processes, I am particularly attracted by his illustration below, summarising the evolution of life in the Earth as a clock in which the history of humans would not take more than a few seconds…

Animal clock

Image extracted from the printed version of the article “Fritz Kahn, un genio olvidado” published in El País, on Sunday 1st of September 2013.

 What could be understood by some as a naive simplification of matters requires, in my opinion, a great deal of scientific knowledge and it is a fantastic effort to communicate the science behind very complex mechanisms.

This and more modern infographic forms of visualisation represent an opportunity for statisticians –and more specifically biostatisticians-, to make our field approachable and understandable for the wider public. Areas such as Public Health (see here), Cancer research (find examples here and here), and Drug development (see here) are already using them, so we should not be ashamed to make of these “less rigorous” graphical representations an important tool in our work.

Note: There are plenty of resources available online to design a nice infographic in R. For a quick peek into how to create easy pictograms, check out this entry in Robert Grant´s stats blog. Also, the wordcloud R package will help you visualising main ideas from texts…

We will soon show a practical example of these representations in this blog, keep tuned!


An example of Principal Components Analysis

The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go!

We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set  gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.  The species are Iris setosa, versicolor, and virginica.

You can load the Iris data and examine this data frame with:

str(iris); summary(iris[1:4])
pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(iris$Species)+1)
mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)


We will use “prcomp” R function to carry out the analysis, which is similar to “princomp” function.

As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.

In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.

#To examine variability of all numeric variables
# maybe this range of variability is big in this context.
#Thus, we will use the correlation matrix
#For this, we must standardize our variables with scale() function:
iris.stand <- as.data.frame(scale(iris[,1:4]))
sapply(iris.stand,sd) #now, standard deviations are 1

Now applied the prcomp() function to calculate the principal components:

#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix
pca <- prcomp(iris.stand,scale=T)
#it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand)
#similar with princomp(): princomp(iris.stand, cor=T)
#This gives us the standard deviation of each component, and the proportion of variance explained by each component.
#The standard deviation is stored in (see 'str(pca)'):

In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:

#plot of variance of each PCA.
#It will be useful to decide how many principal components should be retained.
screeplot(pca, type="lines",col=3)

scree_plotPCAFrom this plot and from the values of the ‘Cumulative Proportion of Variance’ (in summary(pca)) we can conclude that retaining 2 components would give us enough information, as we can see that the first two principal components account for over 95% of the variation in the original data.

#The loadings for the principal components are stored in:
pca$rotation # with princomp(): pca$loadings

This means that the first two principal component is a linear combination of the variables:

PC1 = 0.521*Z_1 - 0.269*Z_2 + 0.580*Z_3 + 0.564*Z_4

PC2 = -0.377*Z_1 - 0.923*Z_2 - 0.024*Z_3 - 0.066*Z_4

where Z_1, \ldots, Z_4 are the standardization of original variables.

The weights of the PC1 are similar except the associate to Sepal.Width variable that is negative. This component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot).  This one principal component accounts for over 72% of the variability in the data.

All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. When the iris has larger sepal and petal values than average, the PC2 will be smaller than average. This component explain the 23% of the variability.

The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.

#biplot of first two principal components
abline(h = 0, v = 0, lty = 2, col = 8)


To interpret better the PCA results (qualitatively) would be useful to have the opinion of an expert in this area, as sometimes is somewhat confusing. I encourage you to participate!


Simulation, a good option to calculate/estimate probabilities

Much of the work we do is based on statistical models used to reproduce real-life processes, yet it is not the only option. When we know the rules and restrictions that dominate a system, we can reproduce their behaviour using simulation techniques and with a few simple statistical calculations, we can get to know everything that we are interested in the system.

Coinciding with the start of the Spanish League 2013-2014 two weekends ago, myself and a group of friends were thinking about a series of questions about the future results of the League:

  1. Will we have a League of two (Barcelona and Madrid)?

  2. What are the chances of my team to win the League (I am a supporter of Valencia)?

  3. Assuming that Madrid or Barcelona will win League, what options have the other teams to obtain the second position?

To answer these questions, I was challenged  to simulate the current League using information from the last 5 seasons:

  • I downloaded the general statistics of the last five seasons from www.as.com website. With this information, I calculated the odds of home win (HW), home draw (HD), home loss (HL), away win (AW), away draw (AD) and away loss ​​(AL) for each of the 20 current teams in Spanish League. (Note: Elche probabilities are the average of probabilities from the 10 teams that have been on the First Division in the last five leagues but are not on that division in the 2013-2014 league.)

  • From the League schedule I have calculated the probability of win, draw or loss for each league matches, being given for each match the names of the local and the visitor teams.

  • I simulated 10000 leagues from which I calculated the probabilities Pr(Win), which is the probability of winning the current league; Pr(Champ), the probability to obtain a position between first and fourth (Champions League); Pr(Eur) is the probability of entering European competitions (Champions or Europa League) ;and Pr(ncc) is the probability of  not changing category.


Pr(Barcelona 1st and Madrid 2nd or Madrid 1st and Barcelona 2nd) = Pr(Barcelona 1st and Madrid 2nd) + Pr(Madrid 1st and Barcelona 2nd) = 0.2930 + 0.2428 = 0.5358


Figure 1: Boxplot of the Points per Team in the 10000 Leagues simulated.

Besides the obvious conclusions we can draw from these results, we can see that we are clearly in a league of two. This sort of procedure will also allow us to emulate complex systems in which we know the rules for calculating the corresponding probabilities. For example, in Biostatistics we could work out the probabilities of an offspring being affected by a genetic disease if we know the probabilities of the parents being carriers (genetic problem).

If you are interested in the topic Simulation and Sports, I recommend reading the paper “On the use of simulation methods to compute probabilities: application to the Spanish first division soccer league” of Ignacio Díaz-Emparanza and Vicente Núñez-Antón, which explains in much more detail how to approach this problem from different points of views.

I have implemented all the calculations for this post with free software R.

Do you have experience in simulation? Tell us about it!!!


Keeping up to date with research in your field (Part II)

  • Online free courses/tutorials: there is plenty of material on line, which makes it sometimes difficult to filter what is really worthy. Here again, tips from blogs or colleagues from your network might serve as reference. Coursera is, in my opinion, one of the best platforms, due to the quality and versatility of its courses. There are several excellent courses related to Statistics and Data Analysis. Some of them are more general about R programming (e.g. Data Analysis,Computing for Data Analysis – both using R- ),but  there are also more specific ones (e.g. Design and Interpretation of Clinical Trials, Statistical Analysis of fMRI Data,.. you can check the full list here.

I would like to mention here some other resources available for those with a math/statistics background who might be interested in getting some insight into genetics. As we mentioned previously in other posts, it is critical to understand the data you are dealing with and these websites will help you with that:

An extensive list of additional Online Genetics Education Resources can be found at the NHGRI site

For those wanting to get an introduction to NGS, there is a Next Generation Sequencing Practical Course at EMB-EBI Train online. A more advanced tutorial, showing the use of R/Bioconductor  packages for High-Throuput Sequence Analysis can be found here.

There are, obviously, countless courses and tutorials about R and specific packages. Besides, GitHub is becoming more and more popular.By creating Gist on GitHub you can share your code quickly and easily, see a quick example here.

  • Webinars:  many commercial sites offer highly focused free Webinars that might be of interest. For instance both Science and Nature host webcasts regularly.

  •  Forums /discussion list: when you are stuck with something and you are not able to find a solution, specialized forums might come to the rescue. Either because your same question has been asked before, or because there is someone willing to help, you will most likely get your doubt solved. Two forums are particularly useful in my field, BioStar and SEQanswers. Talking about R programming, R-help from R Mailing List and Stack Overflow are two of the sites where you can found most of your doubts solved. Our life without them would be much more difficult for sure…

As I mentioned at the beginning of the previous post, it is sometimes difficult to find a balance between the time you spend learning and your more “productive” time. Besides for those of us whose work is also a passion, the line between work and personal interests becomes blurred quite often. And so we will spend much of our leisure time diving around new stuff that eventually will be useful in our work. Some might argue that the time spent in training or the amount of information you have access to might be overwhelming. Is it worth the effort? How much time should we invest in learning? Are we able to take advantage of what we learn? You can take a look at this video  for more elaborate thoughts on the subject.

I hope the information contained in these posts might be useful… Your suggestions on additional resources will be more than welcome!


Keeping up to date with research in your field (Part I)

No doubt about it, we must keep up with news and advances in our area of expertise. In this series of two posts I just want to introduce the ways I find useful in order to achieve this goal. Staying up-to-date means not only knowing what is being done in your field but also learning new skills, tools or tricks that might be applied. I will save for last some thoughts about getting a proper work-learning balance and potential impact on productivity.

  • Blogs. It might be an obvious one, but it is for sure one of my main sources of information. Several blogs I follow include: Getting Genetics Done, Genomes Unzipped, Our 2 SNPs, Wellcome Trust, R-BloggersSimply Statistics and many others mainly focused on biostatistics that you can find in our blog roll. Most of them are accessible through RSS feeds, if not through mail subscription.
  • Twitter. Most blogs have also a twitter account where you can follow their updates (so it might be an alternative). You can follow twitter accounts from networks of interest, companies or people working in your field too. For some ideas on whom to follow, go to our twitter!
  • PubMed / Journals alerting services. A keyword specific PubMed search can be just as relevant. Both available through email and RSS Feeds, you will get updates containing your search terms (for instance “Next Generation Sequencing”, “rare variant association analysis”, “Spastic Paraplegia”…). You can also get information about an author´s work or the citations of a given paper. You can find here how to do it.  An alternative is to set up alerts for Table of Contents of your journals of interests, informing of the topics of latest papers (Nature Genetics, Bioinformatics, Genome Research, Human Mutation, Biostatistics…) Accessing RSS Feeds through your mail app is straightforward -Mozilla Thunderbird in my case-.
  • Professional networking sites. Obviously, when it is all about networking, having a good network of colleagues is one of the best ways to keep up with job offers, news or links to resources. For instance through my LinkedIn contacts I receive quite a bunch of useful tips. Well selected LinkedIn groups are also a source of very valuable information and news, as well as companies in your area or work (pharma industry, genomic services, biostatistics/bioinformatics consulting). This is a more general site, but there are other professional sites focused on Research: ResearchGate and Mendeley. Mendeley in particular, apart from a networking site is an excellent reference manager. This, along with MyNCBI are the two main tools I use to keep my bibliography and searches organized.
  •  Distribution lists.  Apart from general distribution lists including one´s institution or funding agency, more specific newsletters or bulletins from networks as Biostatnet or  scientific societies you belong to, are a good source of news, events and so on, or even more restricted ones (for instance in my institution an R users list has been recently created).

To be continued next week …..


A brief introduction to the SAS® System: what is and how does it work?

In previous posts on this site, comparisons among the most used statistical packages available for scientists were conducted, pointing out their strengths and weaknesses.

Although nowadays there is a trend to use open source programs for data science development, there are some commercial statistical packages which are also important since they make our scientist life easier. One of them is called Statistical Analysis System® (SAS).

SAS System was founded in 1970s and since then it is leading product in data warehousing, business analysis and analytical intelligence. It is actually the best  selling all-in-one database software. In other words, SAS can be described as an Exporting-Transformation-Loading (ETL), Reporting and Forecasting tool. This makes SAS a good option for Data warehousing. It could be also considered as a statistical software package that allows the user to manipulate and analyze data in many different ways.

The main basic component of the SAS program is the Base SAS module which is the part of the software designed for data access, transformation, and reporting. It contains: (1) data management facility – extraction, transformation and loading of data – ; (2) programming language; (3) statistical analysis and (4) output delivery system utilities for reporting usage. All these functions are managed by means of data and call procedures. In the following sections some introductory and basic examples will be described:

(a) Generating new data sets.

It is possible to generate new data sets by using the SAS Editor environment (place where the code is written and executed). Suppose we want to create a data set of 7 observations and 3 variables (two numerical and one categorical).

data one;
input gender $ age weight;
F 13 43
M 16 62
F 19 140
F 20 120
M 15 60
M 18 95
F 22 75

The SAS code shown above will create the desired data set called “one”. As gender variable is categorical – the first variable-,  its values are “M” for MALE and “F” for FEMALE. The dollar sign ($) is used when you have a text variable rather than a numerical variable  (i.e., gender coded as M, F rather than as 1 denoting male and 2 denoting female).

(b) Statistical analysis

For the performance of a basic descriptive analysis, – percentages and mean computations – the next procedures should be executed:

·         For frequencies and percentage computation.

proc freq data = one;
tables gender;

·         Descriptive analysis for continuous variables.

proc means data = one n mean std;
var age weight;

As it can be observed, it is really easy to remember which statistical procedure to use according to the type of variable: for categorical data proc freq procedure  and proc means  for continuous data.

(c) Output

Another important issue is the way the results are shown. The SAS statistical package has improved this section. Before the release of the version 9.0 – I think so!- , one had to copy and paste all the results from the output window to a WORD document in order to get a proper and saved version of the results. From version 9.0, things have changed: All results can be delivered to a PDF, RTF or even HTML format. As SAS user, I can say that this has been a good idea, since not only I no longer have to waste lots of time doing copy-paste, but also the not so useful results can be left unprinted. That has been a great leap!!!  For example, if you want to have the results in PDF format, you should follow the next instructions:

ods pdf;
proc means data = one n mean std;
var age weight;
ods pdf close;

This code generates a PDF file where shows the mean and standard deviation of the age  and  weight  variable of the one data set.

Because of its capabilities, this software package is used in many disciplines, including the medical sciences, biological sciences, and social sciences. Knowing the SAS programming language will likely help you both in your current class or research, and in getting a job. If you want to go further in SAS programming knowledge, The Little SAS Book by Lora Delwiche and Susan Slaughter is an excellent resource for getting started using SAS. I also used it when I started learning SAS. Now it is your turn!!

Have a nice holiday!


Mixed Models in R: lme4, nlme, or both?

The topic of Mixed Models is an old-friend of this blog, but I want to focus today on the R code for these models.

Amongst all the packages that deal with linear mixed models in R (see lmm, ASReml, MCMCglmm, glmmADMB,…), lme4 by  Bates, Maechler and Bolker, and nlme by Pinheiro and Bates are probably the most commonly used -in the frequentist arena-, with their respective main functions lmer and lme.

I am still unsure as to which one I would choose - if I had to -, but I am sharing here a summary of some of their capabilities, in case it can be of help:

Model specification

I will be using for all of the following examples the balanced dataset Orthodont from the package nlme, which reflects the change in an orthodontic measurement over time for a sample of 27 children (see Fig. 1).

Spaghetti plot_Orthodont

Fig. 1. Spaghetti plots. Distance vs age by gender – dataset Orthodont

For simplicity´s sake, I will consider the following initial models with a simple random factor (please see ref. [3] for centering and further analysis):

model_lme<-lme(distance~age+Sex, random=~1|Subject,data=Orthodont)


  • lmer

The results for t-tests and F-tests based on Restricted Maximum Lilkelihood (REML) can be found by using the following lines of code (you can add REML=FALSE to change this default setting):


Fixed effects:

                           Estimate       Std. Error    t value

(Intercept)          17.70671       0.83391     21.233

age                      0.66019        0.06161     10.716

SexFemale           -2.32102        0.76139    -3.048

(Please notice that the reference category in Sex can be changed by using relevel).


Analysis of Variance Table

       Df      Sum Sq     Mean Sq      F value

age  1      235.356    235.356    114.8383

Sex  1        19.034      19.034        9.2875

  • lme

Conditional t-tests and F-tests are used for assessing the significance of terms in the fixed effects in lme.

Both F and t conditional tests results are based on REML conditional estimate of the variance. This option will be the default, but we can specify method=”ML”   for Maximum Likelihood estimates.

For the results from the conditional t-test testing the marginal significance of each fixed effect coefficient with the other fixed effects in the model, we can use:


Fixed effects: distance ~ age + Sex

                                Value      Std.Error    DF         t-value       p-value

(Intercept)       17.706713   0.8339225    80     21.233044      0.0000

age                     0.660185   0.0616059   80     10.716263      0.0000

SexFemale        -2.321023   0.7614168    25     -3.048294      0.0054

For the results from the conditional F-test testing the significance of terms in the fixed effects model (sequentially by default), we can use:


                         numDF      denDF      F-value    p-value

(Intercept)           1                80      4123.156    <.0001

age                       1               80      114.838     <.0001

Sex                       1               25        9.292       0.0054

(The argument type would also allow us to specify marginal F-tests).

These conditional tests for fixed-effects terms require denominator degrees of freedom, which will be the focus of the next section.

Degrees of freedom

  • lme

The denominator degrees of freedom correspond to “the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs” [3]. It is worth noticing that the values for these degrees of freedom do not match those provided by other software procedures such as SAS PROC MIXED (see discussions on the topic here and here).

Additionally to the denominator degrees of freedom aforementioned, conditional F-tests also require numerator degrees of freedom defined by the term (see output from previous section).

  • A good explanation regarding the reporting of degrees of freedom in lmer is given by the author of the package in this article (page 28).


  • lme reports p-values (see previous output), whereas
  • lmer doesn’t but this has been justified by Bates.

Random effects

  • lme allows for nested random effects in a very straightforward way (random=~1|a/b, where factor b is nested in a). Crossed random effects on the other hand, can be dealt with through “a combination of pdBlocked and pdldent objects” [3].
  • Nested random effects can again be easily modelled in lmer (+(1|a/b)). Crossed random effects are handled in an easier way than in lme (+(1|a)+(1|b)). You can find further explanations in [2].

Random effects plots

Two different approaches to the plotting of the random effects can be obtained through the following lines of code:

  • lme
  • lmer

Random effects_plots

Fig. 2. Random effects plots for model_lme and model_lmer.

Residuals plots

  • lme allows to plot the residuals in the following ways:

Plots_lmeFig. 3. Residual plots for model_lme.

  • We can also plot the first two graphs in lmer, but the last line of code does not seem to work with this function.

Correlation structure

  • We can easily incorporate correlation structures in lme.  Mostly used for temporal correlation structures are corAR1, corCAR1 (autoregressive and continuous autoregressive correlation structures of order 1), and corCompSymm (compound symmetry).
model_lme<-lme(distance ~ age + factor(Sex),random = ~ 1 | Subject, cor=corAR1(0.6,form=~1|Subject),data = Orthodont)

Correlation Structure: AR(1)

Formula: ~1 | Subject

Parameter estimate(s):



Further structures available can be found in the help for corClasses.

  • There is not an argument in lmer for doing this, but guidance on how to incorporate the structures can be found here.


  • lme allows you to model heteroscedasticity using the varFunc object,  but
  •  it is not covered by lmer.

Although some other differences could be mentioned, future posts will cover those other matters.

The following books have been found extremely useful:

[1] Badiella, L., Sánchez, J.A. “Modelos mixtos con R versión 0.3” ( from the course “Modelos Mixtos utilizando R”)

[2] Bates, D. (2010). lme4: Mixed-effects modelling with R. Springer

[3] Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer

[4] Zuur, A.F. et al. (2009). Mixed effects models and extensions in ecology with R. Springer

I will be more than happy to read your comments and tips! :-)


Side effects of open health data

Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point.

At the forefront of these changes, reputable publications have taken measures to set transparency standards.  Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012)

In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid).


Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only)

However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10th of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28th of June 2013) .

While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several  doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients.

In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.”

Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues…

Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors.

What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation?


Interview with…Laetitia Teixeira

Laetitia_TeixeiraLaetitia is a graduate in Applied Mathematics and she also has a Master’s degree in Applied Statistics and Modelling from the University of Porto, Portugal. At present, she is a PhD student on Applied Mathematics and she works in the Research Unit UNIFAI (Institute of Biomedical Sciences Abel Salazar, University of Porto, Portugal) 

Email: laetitiateixeir@gmail.com

1. Why do you like Biostatistics?

Biostatistics allows for the application of statistical theory to various areas of research practice and to work in various areas. Statistics and medicine are two areas of great interest for me and biostatistics allows me to work in both.

2. Could you give us some insight in your current field of research?

My PhD work focuses on survival analysis in the presence of competing risks. All the practical work is based on end-stage renal disease patients with peritoneal dialysis as renal function replacement therapy. We explore several statistical approaches, such as regression models taking competing risks into account, multistate models and joint models for longitudinal and competing risks data. Using these approaches, we can give more and better information about the disease progression, helping clinicians in the evaluation of patients and treatment planning.

Combined with my PhD, I am a fellow researcher at UNIFAI/ICBAS-UP, a Research Unit specialized in ageing and health.

3. Which are, in your opinion, the main advantages of being a researcher?

The opportunity to work in several areas with multidisciplinary teams.

4. What do you think of the situation of young biostatisticians in Portugal?

In Portugal, biostatisticians are mostly present in higher education institutions. Some public and private enterprises have been integrating some young biostatisticians, however in a very limited number. Some colleagues have gone to other European countries, where they have found better opportunities in this area.

5. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Interested in research, versatile and good communicator.

6. Which do you think are the main qualities of a good mentor?

Motivator, interested in research and dedicated, good communicator.

7. Finally, is there any topic you would like to see covered in the blog?

A list of working groups organized by research themes. This list would be important for young biostatisticians to find people according to working area and would allow students/researchers to create other networks.

Selected publications:

  • Teixeira, L., Rodrigues, A., Carvalho, M.J., Cabrita, A. & Mendonça, D. (2013). Modeling Competing Risks in Nephrology Research: An Example in Peritoneal Dialysis. BMC Nephrology 2013, 14:110 doi:10.1186/1471-2369-14-110
  • Cotovio, P., Rocha, A., Carvalho, M.J., Teixeira, L., Mendonça, D., Cabrita, A., & Rodrigues, A. (2013). Better Outcomes of Peritoneal Dialysis in Diabetic Patients in Spite of Risk of Loss of Autonomy for Hemodialysis. Accepted – Peritoneal Dialysis International.
  • Rocha, A., Rodrigues, A., Teixeira, L., Carvalho, M. J., Mendonça, D., & Cabrita, A. (2012). Temporal Trends in Peritonitis Rates, Microbiology and Outcomes: The Major Clinical Complication of Peritoneal Dialysis. Blood Purification, 33(4), 284-291.

Between descriptive analysis and decision making

When faced with a statistical analysis task, we usually begin with a simple description about our data that will allow us to analyze and interpret our variables, with the aim of making a decision on some hypotheses that had been made at the start of the study.

This post deals with some statistical methods that are regarded as exploration techniques, but that go further than simple and usual descriptive. This talk will focus on correspondence analysis (CA) and principal components analysis (PCA), both are central to Multivariate Analysis.

PCA and CA are usually applied in high dimensional datasets with the principal objective being to reduce the dimensions of the data. Although these methods have their own particularities, both conclude to explain latent variables in the problem through observed data.

PCA: is widely used to capture essential patterns of big datasets. In high dimensional data sometimes it is difficult for researchers to extract interesting features, so one way to solve it is to reduce its dimensionality at the expense of losing information. It works by creating new uncorrelated variables Y_i (named as PCA) through linear combinations of original variables X_k (in general correlated).

Y_i = \alpha_1X_1 + \alpha_2X_2 + \ldots + \alpha_jX_j

These PCA collect all information than original variables, and the goal is to select some PCA by preserving as much data variance as possible.

CA: unlike PCA, this methodology is applied in categorical data (without calculate linear combinations) as a procedure to analyze contingency tables. CA allows us to describe the relation between two nominal variables as well as the relation between the levels of themselves in a Cartesian axis.

The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis.

The applications of PCA and CA are wide and varied, in fields such as biology, ecology, social sciences, psychology, image processing … in which the number of variables is big. As we have said before, in that situation the PCA and CA provide us a method to extract latent variables and population intrinsic characteristics that have not been observed, so that we can think in these as a hypothesis generation system.

With ‘ca’ (‘mjca’) and ‘princomp’ packages of R we can apply Simple (Multiple) Correspondence Analysis and Principal Components Analysis into our data. The following figure illustrates a typical graphic of PCA, representing the first two components.


Here I have briefly commented a little aspects of two procedures to describe large datasets. In following posts I will try to do an example using R. Meanwhile, you can try and play with ‘ca’ and ‘princom’ functions.


How to manage longitudinal data with R?

During the time that this blog has been running several posts about longitudinal data has been published. However, we have not talked yet about how we can deal with them with R.

As we have discussed in other posts, longitudinal data gather information from a group of study subjects over time (repeated measures). When the number of measurements is the same for all subjects and these measurements are equidistant along time, it is said that we have a balanced data. Otherwise our data are called unbalanced.


When working with either format in R, the joineR package that allows us to adapt our data. If our data are balanced we can move from one format to another according to the analysis we are interest in, simply by:

# generate a (balanced) data
simul <- data.frame(id = 1:3, Y.1 = rnorm(3), Y.2 = rnorm(3), age = runif(3,0,18))
# move it to an unbalanced format
simul.unbal <- to.unbalanced(simul, id.col = 1, times = c(1,2), Y.col = 2:3, other.col = 4)
# return data to a balanced format
simul.bal <- to.balanced(simul.unbal, id.col = "id", time.col = "time", Y.col = c("Y.1"), other.col = 4)

Once we have our data in an appropriate format, one of the first descriptive analysis to do is to get the empirical longitudinal variogram. The variogram allows us to check if within-subjects observations are related. In order to do that, we need to have the unbalanced data format and we will get it very easily by using the joineR variogram (although expect it to be a bit slow if you have a large dataset).

 As an example, we will load a data set included in the Applied Longitudinal Data Analysis book by Judith D. Singer and John B. Willett and calculate the empirical variogram:

# read in data set (tolerance data from ALDA book)
tolerance <-read.csv("http://www.ats.ucla.edu/stat/r/examples/alda/data/tolerance1_pp.txt")
vgm <- variogram(indv=tolerance$id, time=tolerance$time, Y=tolerance$tolerance)


The package also allows us to make some more plotting functions and analysis of longitudinal and survival data together using random effects joint models. Certainly a very interesting package for those who deal with this type of data or  are interested in start working with them.

Try it and tell us your experience.


Guest blogger: Tales of R

Collaboration between statisticians and researchers

In the last few days I was thinking about how researchers could collaborate efficiently with their experts in statistics. With the increasing complexity in science, interchanging information can be crucial to get the best results. But what happens when a researcher doesn’t give all the information a statistician needs?

When someone asks for help in this field -as a clinician, I need it often-, many biostatisticians ensure that some basic points are met in a research study -prior to the actual analysis-, and I think they should ask for them as soon as possible:

  • Is the research question sensible, and supported by the literature in any way? A good research idea should be accompanied by a short review with pros, cons and all the important references, which could guide you to the most appropriate or most used statistical methods in whatever field. Ask for it, read it. If the study has a flaw at this point, it’s easier to correct them now than later.
  • Is the research question defined and detailed at the end of the review? Does it have a main objective? Do they plan further analyses depending on the results? Do researchers give enough information for sample size calculation? With these points you can avoid getting stuck in the middle of a study for a long time. The scientific method is always a good guide if things go wrong.
  • Data? Ok, they have planned something taking into account their available knowledge. But how about the resources? Can they collect all the data they need? If not, how can the design be adapted? Often, they have to change something in the research question, and start again. But in the worst case it takes much less time than not doing things right from the beginning.

Have they -researchers, clinicians, whoever-, met all three tips above? Then, your chances of helping them to answer the question will increase, and even the time to the answer can decrease substantially.

I hope this has been useful!

You can find my blog “Tales of R…messing around with free code” here.


Sharing statistical analysis and results through web applications

I have to admit I am not completely comfortable with the RStudio IDE yet. However, RStudio projects are of great interest, and the new package Shiny – released last November- is no exception.

Shiny allows you to build web applications in R, so anyone might be able to access your analysis results through an interactive web browser interface. As they state in their website “…your users choose input parameters using friendly controls like sliders, drop-downs, and text fields. Easily incorporate any number of outputs like plots, tables, and summaries..”

The application is structured in two components: a user interface script named ui.R  and a server script – server.R. The first one deals with the input and output format, the second one contains the code to run the analysis.

I made a quick trial following the tutorial and my (very simple) web app was ready in no time. It is based on some of the examples of the SNPassoc vignette, an R package designed to perform genetic association studies. In this application, you can check the summary for a small set of SNPs and get both a plot showing the frequencies and the results of the association study for a given SNP.

By using Shiny, you can run your applications locally or share them with other users so they can run the applications in their own computers. There are several options to distribute your apps, you can check all of them here. In this case, the app can be accessed through the GitHub Git repository. Once the Shiny package is installed (along with the SNPassoc package used in this example) you just have to run the following code:

shiny:: runGist('4e5e618431a59abe692b')

In my opinion this tool has great potential for sharing analysis results through an interactive and friendly interface.  It might replace files containing a lot of graphics and tables while saving time both to the data analyst and the end user. Have you tried it? Do you want to share your apps with us?


Interview with… Guillermo Vinué Visús

Guillermo Vinué Visús completed his degree in Mathematics in 2008, granted by the Universitat de València (Spain). He also holds a Master’s degree in Biostatistics from the same university. After working at the Drug Research Center of the Santa Creu i Sant Pau Hospital in Barcelona (Spain) for a year, he is currently a PhD student in the Department of Statistics and Operations Research at the Universitat de València. His doctoral thesis focuses on Statistics applied to Anthropometry.

1. Why do you like Biostatistics?

I like Biostatistics because it allows me to apply Maths to different real life problems.

 2. Could you give us some insight in your current field of research?

 I am working on  a research project aiming to develop statistical methodologies to deal with anthropometric data, in order to tackle some statistical problems related to Anthropometry and Ergonomics.

3. Which are, in your opinion, the main advantages of being a researcher?

The main advantage is the possibility to learn everyday a bit more. Research is a continuous learning process.

4. Your whole professional experience has been within the public sector and  the University. How do you see the present and future of research in the Spanish public sector?

The current situation of economic difficulties has caused that unfortunately the government budget for scientific research is more and more limited, so I am concerned about both present and future of the public Spanish research.

5. What do you think of the situation of young biostatisticians in Spain?

Neither better nor worse than other young people in the country. Nowadays, I guess the best way to make progress is to move abroad.

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

 Enthusiasm, effort and a little bit of R knowledge.

7. Which do you think are the main qualities of a good mentor?

To be attentive and available when needed.

Selected publications:

  • Ibañez M. V., Vinué G., Alemany S., Simó A., Epifanio I., Domingo J., Ayala G., “Apparel sizing using trimmed PAM and OWA operators”, Expert Systems with Applications 39, 10512-10520, 2012, http://dx.doi.org/10.1016/j.eswa.2012.02.127
  • Epifanio I., Vinué G., Alemany S., “Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem”, Computers & Industrial Engineering 64, 757- 765, 2013, http://dx.doi.org/10.1016/j.cie.2012.12.011

P-splines for longitudinal data


Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).

Contact Ipek

In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .

Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.

In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.

Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).

This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.

First of all, let`s define a linear mixed model for a longitudinal data.

As a matrix notation;


  • y is the vector of the observed responses ,

\left(\begin{array}{cc}y_{1}\\\vdots\\ y_{n}\end{array}\right)=\left(\begin{array}{cc}y_{11}\\y_{12}\\\vdots\\ y_{1n_{1}}\\ y_{2_{1}}\\\cdot\\\cdot\\\cdot\end{array}\right)

  • \beta is the fixed effects parameters vector,
  •  \varepsilon=\left(\begin{array}{cc}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\ \varepsilon_{N}\end{array}\right)  is the residual vector,
  •  u=\left[\begin{array}{cc}u_{1}\\u_{2}\\\vdots\\ u_{N}\end{array}\right] denotes unknown individual effects.
  • Z=\left[\begin{array}{ccc}Z_{1}&\ldots&0\\\vdots&\ldots&\vdots\\ 0&\ldots&Z_{N}\end{array}\right]  a known designed matrix linking u to y,
  • X=\left[\begin{array}{cc}X_{1}\\X_{2}\\\vdots\\ X_{N}\end{array}\right]

  u_{i}, i=1,\ldots,N assumed to be N(0,G), independently of each other and of the \varepsilon_{i}.\varepsilon_{i} is distributed as N(0,R_{i}) where R_{i} and G_{i}  are positive definitive covariance matrices.

Consider a flexible regression model,


\varepsilon_{i}\sim N(0,\sigma^2)

Where y_{i} is the response variable of the observations i=1 ,\ldots, N and f(\cdot) is the smooth function of covariate x. We represent this function with a linear combination of d known basis function B_{j}.

And we reformulate it as;


\varepsilon\sim N(0,\sigma^2I)

Depending on the basis that is used for the P-splines, X and Z have the following forms;

  • Truncated polynomials

Z=\left[(x_{i}-\kappa _{k})^p_+\right]
1 \leqslant i \leqslant n
1\leqslant k \leqslant \kappa

  • B-splines



Where U  is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix  P=D\prime D and \Sigma   is a diagonal matrix containing the eigenvalues with q null eigen values (Lee, 2010).

Therefore the model becomes;


u\sim N(0,\sigma^2_{u}I_{c-2})

and \varepsilon\sim N(0,\sigma^2I)

Where c is the number of columns of basis B, and the smoothing parameter becomes; =\frac{\sigma^2}{\sigma^2_{u}} (Durban et al, 2005).

With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).


SPSS, SAS or R…which one to choose?

Performing statistical analyses by hand in the era of new technologies would seem crazy.  Nowadays, there are three main statistical programs for doing statistics: IBM SPSS, SAS  and R, as it can be read in a more extensive post of this site. Sometimes, biostatisticians need to use more than one package to carry out their analyses.  This means that users of these programs have to move from one to another environment, from front-end to back-end, using different wizard and graphical interfaces, wasting in most occasions an important amount of time. Because of that, in this post I would like to address the differences between the aforementioned programs, pointing out the advantages and the most suitable usage of each software from my personal point of view.

For a good choice of a statistical program, one should take into account several issues such as the following:

1. Habit: It refers to how the biostatistician has gotten used to one particular statistical program and does not need to learn something new.

2. Distrust of freeware: it refers to the license type of the package.

3. Availability of statistical procedures:  are the most used statistical methods included in the package?

4. Management of huge data sets: how big data sets are handled.

Focusing on the first point, it is remarkable that programming with these three packages is quite easy. All of them provide a choice to write the syntax code and run it whenever you want. This also can be applied to anyone who might have to send you the code, including managers or colleagues. Besides, IBM SPSS and R offer a friendly interface allowing the alternative of not learning the syntax procedure. These could be the main reasons for which people continue doing data analysis for a long period with the same statistical program, making a better researcher life!

Another important issue is whether the distribution is freeware or not. When starting to work as a biostatistician, one soon notices the high cost of statistical software licence fees. As you might know, IBM SPSS and SAS are commercial distributions and  R is freeware. Although IBM SPSS and SAS are quite expensive, one should think about the benefits that both programs offer in the long term. For example, some researchers do not trust R, and think its use does not ensure getting the correct results. In those cases, trust comes at a price. Regadless of what they think, R has grown rapidly in the past few years, offering competition to other softwares.

The availability of the most used statistical procedures is another important point to be taken into account.  All of them have integrated in their environments the most popular statistical methods (regression analysis, correspondence analysis, etc..) but sometimes, there is a need for implementing more sophisticated procedures.  As IBM SPSS and SAS are paid-for distributions, they do not let the user to program it, they have fixed instructions. On the contrary, R could be understood as a combined and customized software of multiple small packages to solve the problems. Whenever you have a problem, it is most likely that you will find a solution for it in the packages repository.

Finally, as for the management of huge data sets, it is noteworthy that SAS is the winner. SAS has a robust platform for handling big data sets and it does not  need the whole computer memory for the analysis. In this sense, R still has a long way to go even though the performance of R in this area has recently been exponentially increased. I would also say that IBM SPSS does not have enough support.

To sum up, I would like to give a couple of advices. From the point of view of my experience, I would recommend to use IBM SPSS for medical researchers who start doing statistics: it is quite easy and intuitive and there is no need to learn the syntax. But not for a long period of time: learning basic R coding is not difficult and will be sufficient for the requirements of this type of researchers. It is only a small effort but really worthy! Nonetheless, for advanced researchers and biostatisticians, a combination between SAS and R would be perfect – that is what I do!, I would say I am RSAS user… -. Now, you have the choice: IBM SPSS, SAS or…R?


…a scientific crowd

While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.

Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.

The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”

Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here). sociogram_biostatistics

Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)

In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).


Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17th of May 2013.

Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.

That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!

Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!


A computational tool for applying bayesian methods in simple situations

Luis Carlos SilvaLuis Carlos Silva Ayçaguer. Senior researcher of the Escuela Nacional de Salud Pública from La Habana, Cuba; member of the development team of Epidat. Degree in Mathematics from Universidad de La Habana (1976), PhD from Universidad Carolina (Praga, 1982), Doctor of Science from Universidad de Ciencias Médicas (La Habana, 1999), Titular Academic from República de Cuba.
Email: lcsilva@infomed.sld.cu

Web: http://lcsilva.sbhac.net

Soly SantiagoSoly Santiago Pérez. Technical statistician at Dirección Xeral de Innovación e Xestión da Saúde Pública (General Directorate of Public Health, Spain) from 1996 to present; member of the development team of Epidat. Degree in Mathematics from Universidad de Santiago de Compostela (Spain, 1994). Specialist in Statistics and Operational Research.

Email: soly.santiago.perez@sergas.es

In this post, we present a user friendly tool for applying bayesian methods in simple situations. This tool is part of a free distribution software package, Epidat, that is being developed by the Dirección Xeral de Innovación e Xestión da Saúde Pública (Xunta de Galicia, Spain) since the early 90′s. The general purpose of Epidat is to provide an alternative to other statistical packages for performing analysis of data; more specifically, it brings together a broad range of statistical and epidemiological techniques under a common interface. At present, the fourth version of Epidat, developed in Java, is freely available from the web page: http://dxsp.sergas.es; to download the program, registration is required.

As stated above, one of the methods or “modules” included in Epidat 4 is Bayesian analysis, a tool for the application of bayesian techniques to basic problems, like the estimation and comparison of means and proportions. The module provides a simple approach to Bayesian methods, not based on hierarchical models that go beyond the scope of Epidat.

The module of Bayesian analysis is organized into several sub-modules with the following scheme:

  • Bayes’ theorem
  • Odds ratio
  • Proportion
    • One population
      • Estimation of a proportion
      • Assessment of hypotheses
    • Two populations
      • Estimation of effects
      • Assessment of hypotheses
  • Mean
    • One population
      • Estimation of a mean
      • Assessment of hypotheses
    • Two populations
      • Estimation of a difference
      • Assessment of hypotheses
  • Bayesian approach to conventional tests

The first option of the module can be used to apply Bayes’ theorem. The following three options (Odds ratio, Proportion and Mean) are designed to solve basic inferential problems under Bayesian logic: estimation of odds ratios, proportions and means, as well as differences or ratios of these two last parameters. The punctual estimation is accompanied by the corresponding credibility interval. The techniques available in these options also include methods related to hypothesis testing. Finally, the last sub-module allows the evaluation of conventional tests from a Bayesian perspective.

Estimation of a proportion

Some specific options of Bayesian analysis require the use of simulation techniques. Nevertheless, the user does not need to understand the simulation process to use the program and interpret the results correctly. In addition, the module has a user-friendly interface that allows the user to plot the a priori distribution and choose the values of its parameters (see figure above). The output of Bayesian analysis includes both numerical and graphical results, and the graphics can be edited and modified by the user. In addition, the contents of the output window can be saved as a file in several formats: *.epi, *.pdf, *.odf, or *.rtf.

Finally, like all modules of Epidat, Bayesian analysis has a useful help feature which can be accessed on the help menu in pdf format. This facility has been developed with a didactic and critical approach, including statistical basics of the methods, bibliographic references, and examples of the different options.


R2wd package: another tool to create reports

The last post published in FreshBiostats, by Hèctor Perpiñán, was about reports combining the use of R and Latex. In it, he explained how to transport tables and figures from the R output to Latex.

Well then, this week I shall continue talking about another tool to write reports when we do the statistical analysis with R. It allows us to reduce the time spent to draft the reports and be more accurate by minimising human copying errors. I am going to focus my attention on “R2wd” package to write MS-Word documents from R.

Although most statisticians are used to work daily with Latex which results in high quality appearance, in most cases our clients have never heard about it and prefer to work with MS-Word documents, which are easier to handle.

The R2wd package needs either the statconnDCOM server (via the rcom package) or the RDCOMClient to communicate with MS-Word via the COM interface. Once this interface is installed (on Windows OS), we can create a document using all functions available in this package: wdGet, wdTitle, wdSection, wdType, wdSetFont, wdTable (the object must be a data frame or an array), wdPlot, wdApplyTheme, etc.

Not only do these functions allow us to introduce tables or figures from R output, but can also be used to inject text elements into Word report, themes and templates, different type of text (normal, italic, Verbatim,…), etc.

The following code shows a little example on how to use this package:

### An easy example to use "R2wd" package ###

library("R2wd") #of course, you must install it before!

wdGet() #To open a new Word document

#It is possible that we get an error message with wdGet() function.
#In this case we will need to install:

wdTitle("We introduce here our file title")
wdSection("First section begins here")

wdBody("Inserts text in 'Body' style at the current cursor point in Word.")
wdWrite("The function wdBody is similar to 'wdWrite'",paragraph=TRUE)
wdType("Now we use an italic type and a centering text",italic=TRUE,alignment="center")
wdNormal("To return a Normal type")
wdWrite("We can also insert footnotes")
wdInsertFootnote("insert a footnote") #you can insert footnotes in the
#word document at the cursor position.

wdSection("Now we include another section")
wdSubsection("A toy example")

wdBody("Insert tables of results")

age <- c(23,12,45,34,18,41)
gender <- c(1,1,0,1,0,0)
height <- c(172, 150,169,180,188,160)
data <- data.frame(age,gender,height)

wdTable(data,autoformat=2)#three possibilities to autoformat parameter
wdWrite("insert a text in the bullet",paragraph=TRUE)
wdWrite("we finish the description",paragraph=TRUE)
wdBody("To finish this easy example we include a plot in our Word document")

wdPlot(data$age,data$height,col="red",pch=20,height = 10, width =10, pointsize = 25)

wdQuit() #To close the session

As you have seen, besides being able to format our report as we want (although it may seem cumbersome to do from R), R2wd is really useful because with an only function – wdTable() – and one “click” I can copy a large table in my report with no mistakes.

You will find more functions here. Dare to try it! You will save plenty of time!


Some tools to do a report (with Latex)

In another post I have already talked about different topics not directly associated to Statistic. At this time I will write about reports combining the use of R and Latex. Specifically about how to export tables to Latex and insert latex typesetting in R figures.

Obviously our technical work is very important but the final presentation which should be taken care of. Sometimes the problem is to move our results from the statistical software to the text editor so I will talk in this post about my favourite tools to perform this task.

When looking for a text editor to do a report, there are may options to choose between (MS Word, Latex, knitr/Sweave, Markdown, etc.). I pick out Latex versus MS Word given its possibility to easily include math language. The connection between knitr/Sweave and Markdown with R is straightforward (these two software imbibe R code), in future posts I will talk about them.

R tables to Latex

As I commented  I first work with R and then move the results to Latex. One of the simplest and most effective tools is the function xtable() that we can find in the package of the same name, xtable.

  • xtable(), this function creates the code that we write in latex from an R table.

Other important packages in R with extra functionalities to convert object from R to Latex code are:

  • Hmisc: you can use the function latex() to create a tex file.
  • texreg: highly configurable and designed to convert model summary outputs.


Many people will know the graphics of Latex, called TikZ (see page examples), but what is not so known is that this type of graphics allow a wonderful connection between R and Latex. It allows that formulas, numbers, etc. displayed in the format of choice in Latex. To do this we simply have to follow the next steps:

  1. In R: Create the tikz images from R using tikz() function, see this, in package tikzDevice ( install.packages(“tikzDevice”, repos=”http://R-Forge.R-project.org&#8221;) ). This function will create a tex file that can be compiled into PDF, but the only thing you need to do is to include the tex file into your Latex document.
  2. In Latex: Insert in Latex document with function \input{normal.tex} and \input{symbol-regression.tex}. The R figure uses directly your format in Latex. You need include on preambule of Latex \usepackage{tikz}.


symbol-regressionWill you tell us your “tricks” when making reports?


Dealing with strings in R

As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.

Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.

Some R functions I have found quite useful when handling this data include the following ones:

  • colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
  • grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern

As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.

Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:

  • str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
  • fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions

Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.

This is just a brief summary of some options availabe in R. Any other tips on string handling?


Interview with…Moisés Gómez Mateu


Moisés Gómez Mateu is a PhD student at the Universitat Politècnica de Catalunya (UPC) where he works as research assistant. 

Contact Moisés

1. Why do you like Biostatistics?

I have been very curious since I was a child; that’s why I like statistics. Moreover, statistics related to biology and medicine allows helping people to improve their quality of life.

 2. Could you give us some insight in your current field of research?

I focus my thesis on survival analysis, especially on the issue of composite endpoints in clinical trials. The main aim is to analyze what is the best primary endpoint to use, extend the statistical theory, and make practical tools available to researchers by means of a library in R, an on-line platform, etc.

3. Did you find it difficult to move from the private sector to the University?

No. In fact, I left my job as a consultant in a marketing research company to study the MSc Statistics at the UPC, and I think it was a very good decision.

 4. Which are, in your opinion, the main advantages of being a researcher?

It is very satisfactory. The results you get or the research you conduct have nothing to do with the private sector. One usually investigates issues that are related to thing you like and to help improving science in general, not only to earn money.

 5. What do you think of the situation of young biostatisticians in Spain?

The reality is that several colleagues and friends are working-studying abroad or looking for opportunities …

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

 Curiosity, Analytical skills and Creativity.

 7. Which do you think are the main qualities of a good mentor?

Expertise, Modesty and Open-mindedness.

Selected publications:

  • Gómez G, Gómez-Mateu M, Dafni U. Informed Choice of Composite Endpoints in Cardiovascular Trials.Submitted.
  • Gómez G, Gómez-Mateu M. The Asymptotic Relative Efficiency and the ratio of sample sizes when testing two different null hypotheses. Submitted.

Hierarchical models: special structures within repeated measures models.

Living creatures tend to organize their lives within structured communities such as families, schools, hospitals, towns or countries. For instance, students of the same age living in a town could be grouped into different classes according to their grade level, family income, school district and other features of interest. Other examples related with health care workers and patients show clear hierarchical data structures as well.
Hierarchical or nested structures (usually known as HLM) are very common throughout many research areas. The starting point of this data pattern was set up in the field of social sciences. Most research studies in this area were focused on educational data, where the main interest was to examine the relationship between inputs such as students, teachers or school resources, and student outcomes (academic achievement, self-concept, career aspirations…). Under this scenario, researchers emphasized that individuals who are drawn from an institution (classroom, school, town, …) will be more homogenous than subjects randomly sampled from a larger population: students belonging to the same classroom have the experience of being part of the same environment (places, district, teachers,…) and experiences. Due to this fact, observations based on these individuals are not fully independent.


As noted, hierarchical models consider that exist dependency among observations within the same study unit. Till last decades, owing to the lack of software development, ordinary least squares regression (OLSR), classical regression, has been used to estimate the aforementioned relationships. On consequence, results obtained from OLSR show too small standard errors leading to a higher probability of rejection of a null hypothesis than if: (1) an appropriate statistical analysis was performed; (2) data included truly independent observations. It is clear that the main issue that researchers must address is the non-independence of the observations.
Hierarchical modeling is similar to OLSR. It can be seen as an extension of the classical regression, where at least two levels are defined in the predictive model. On the base level (also called the individual level, referred as level1), the analysis is similar to OLSR: an outcome variable is defined as function of a linear combination of one or more independent level 1 variables:

Y_{ij} = \beta_{0j}+X_{1}\beta_{1j} +\ldots+\beta_{kj}X_{k}+\epsilon_{ij}

where Y_{ij} is the value of the outcome variable of the ith individual of group j, \beta_{0j} represents the intercept of group j, \beta_{1j} is the slope of variable X_{1} of group j. On subsequent levels, the level 1 slope and intercept become dependent variables being predicted from level 2 variables:

\beta_{0j} = \delta_{00}+W_{1}\delta_{01} +\ldots+\delta_{0k}W_{k}+u_{0j}
\beta_{1j} = \delta_{10}+W_{1}\delta_{11} +\ldots+\delta_{1k}W_{k}+u_{1j}

Though this process, it is possible to model the effects of level 1 variables and level 2 variables on the desired outcome. In the figure of this post, one can observe that there are three main levels: patients (level1) belong to hospitals (level 2) where, at the same time, hospitals are located in certain neighborhoods (level 3).

This kind of modeling is essential to account for individual – and group level variation in estimating group-level regression coefficients. However, in certain cases, the classical and HLM approaches coincide: (1) when there is very little group-level variation and (2) when the number of groups is small and consequently, there is not enough information to accurately estimate group-level variation. In this setting, HLM gain little from classical OLRS.

Now, it is your opportunity. You know where is worth the effort of applying the HLM methods instead of classical regression.


A pinch of OR in Biostatistics

The first thing in common between Operational Research (OR) and Biostatistics, is that both terms are often misunderstood. OR, or the Science of Better, as it is commonly known, has lots to do with optimization but there is much more to it…Some might be surprised to read that many of the tools of this discipline can be -and are actually- applied to problems in Biostatistics.

It all starts with Florence Nightingale  volunteering as a nurse in the Crimean war in 1853…Her pioneering use of statistics for evidence-based medicine, and her operational research vision to reform nursing, led to a decline in the many preventable deaths that occurred throughout the nineteenth century in English military and civilian hospitals.

Since then, integrated approaches are commonplace: disease forecasting models, neural networks in Epidemiology, discrete-event simulation in clinical trials,…


Figure: Genetic algorithm representation using R. Find here the code for the original dynamical plot by FishyOperations.

There is also an increasing interest in computational biology in OR. Examples of application of these techniques vary from linear programming based flux balance analysis models for cell metabolism studies, to sparse network estimation in the area of Genetics.

In R, there are many packages with which to apply these techniques, from genalg for Genetic algorithms (used to recreate figure above) or games on Games theory, to more specific packages in Bioconductor like CellNOptR, survcomp or OLIN. Also, a general Task View on Optimization can be found here.

Finally, a quick mention to the YoungOR 18 Conference that will be held in Exeter (UK), 9-11 April 2013. It will be covering topics of common interest for biostatisticians with streams on Health, Sustainability, etc. Plenty of inspiration for future posts!

Have you ever used any of these techniques?Any particular tips that you want to share?Tell us about it!


Biostatistics in a Health Services Environment

Aurora Baluja is MD in the Department of Anesthesiology, Intensive Care and Pain Management of the Hospital Clinico Universitario of Santiago de Compostela, and PhD candidate at the Forensic Medicine Unit of the University of Santiago de Compostela (Spain).

As a medical doctor involved in patient management, -in operating room and in the ICU-, every day I witness the huge load of data that has to be processed, interpreted, and stored. In this situation, Biostatistics and Bioinformatics support becomes greatly important to analyse trends and patterns, that  ultimately lead to improve patient care.

I do research about risk profiles of mortality in the ICU, and given the amount of data obtained, the task of reporting my results and ensure reproducibility became very time-consuming.

In this post I intend to comment on the main tools that helped me so much to overcome those obstacles:

  •  R: for me, the possibility of learning a statistical computing language to write and recycle scripts, was definitely the way. R is a powerful, open source programming language and software environment, that allows a full variety of analyses and… beautiful graphs.
  • RStudio: to find a good IDE  for R code was also important. Fortunately, Altea recommended me this easy-to-use, visual piece of cake, with integrated options to report code and results… Thanks!
  • R Markdown is a quick, easy method to write scripts and to report clean code and results, with some formatting options, in an HTML page. It is implemented in R-studio by the package “knitr”. Just one click away, you can generate a complete, dynamic report with text and graphics.
  •  LaTeX: the main reason for me to use LaTeX is not writing documents with beautiful symbols and equations, but to somehow link my R console to a document editor, in order to generate tables and reports, just like I do with R Markdown. For this, I need to use R-studio again,  and 4 more ingredients:
  1. A LaTeX distribution: I have installed TeX Live for Linux, and MiKTeX for MS Windows. A variety of templates, included often in distributions, make the first approach quite easy. Beyond the use of Sweave (see below), I became so fond of LaTeX that now it is my favourite text editor for documents, posters and slides!
  2. R Sweave is the link between R-scripts and LaTeX, making possible to write an entire LaTeX document with dynamic results from R commands embedded in it.
  3. Texreg: maybe the reason why I’ll never quit using LaTeX. Its magic begins after you have run several models of your data, and you are trying to see and compare *all* of them at a glance. It generates latex-formatted tables with your models, ready to paste into a LaTeX document. As a tip, I often use them preserving their \{tabular} environments and customising \{table} options to fit my document style.
  4. Xtable: another R package, that allows to print nice tables in LaTeX format.

I encourage those who haven’t used any of these tools, to give them a try… surely they will help you!


Measuring Disease Trends through Social Media




Nichole Knupp is a freelance blogger and marketing professional who writes guest posts and website content for a variety of different blogs and niches that interest her.


People in the field of Medicine, those in Master´s in Public Health programs (see examples here and here) or in other disciplines such as Biostatistics (see previous post on the topic) might wonder how social media can play a valid role in their career.  While social media has taken the world by storm, it has mostly been thought of as a tool for sharing personal information, marketing, or keeping up on the latest gossip.  However, after this past flu season, those in the medical and scientific fields are finding a whole new way to use information that is trending.  Here are just a few of the things that social media is doing to benefit our health and sciences community.

Measuring Disease Trends
One of the terms being used for those watching and tracking the spread of illnesses through social media is “social media analytics” (see for instance, social network analysis).  Quite simply, it is tracking outbreaks of various viruses through mining social media.  This past season, people were following the spread of influenza through a flu hash tag (#flu).  Viruses such as H1N1 and the Swine Flu were able to be monitored on Twitter.  Further, they found when information was extracted efficiently, it was accurate and there was even a possibility of forecasting further trends.  The most pertinent information though, was what was happening in the present.  Different agencies were able to find what area was the hardest hit and who was at risk.

Public Interest and Concerns
The health and sciences community also found social media to be a great platform in finding out how interested the public was in disease outbreaks and trends.  It was also a way to gather concerns so various agencies could address them.  Social media is utilized by people as a way to have their voices heard through an outlet that is part of their daily lives.  While some people are unwilling to sit down and answer a survey, important information can be gathered by what is mentioned on daily Twitter streams.

Disease Outbreaks and Travel
Since social media is a tool that is being used worldwide, it has also been helpful in tracking outbreaks abroad.  US citizens, and others traveling, can easily be informed through public social media announcements.  Currently there are conversations happening on how best to use the information gathered on social media, and what role it will play in informing the public and addressing questions and concerns. (A very interesting article on the topic can be found here).

Social Media Sources
While Twitter has been the main source used for tracking trends, Instagram, Facebook, and Google Trends also play a part.  Some places to look if you’re interested in seeing how it all works is checking out the HealthMap stream on Twitter (@healthmap), do a Google Trends “flu” search, or try a flu hash tag search (#flu) on Twitter.

It’s not hard to imagine the possibilities and many other things we will be able to track in the future, but one thing is already certain— there is a use for social media if you work in the medical sciences, Public Health, Biostatistics,…


Do we need Spatial Statistics? When?

Spatial Statistics, What is it? and Why use it?

The approach taken in this post is to offer an introduction of basic and main concepts of spatial data analysis as well as the importance of its utilization in some areas like epidemiology or ecology among others. But, before to introduce a definition of spatial statistics and some concepts about this field, I consider relevant to mention some situations in which our data need to be seen as ‘spatial data’.

It is possible that people associate spatial statistics with analysis that contain numerous maps. However, it goes beyond creating these, in fact spatial data analysis is subject to internal structure of the observed data. We therefore have to be careful with the questions not directly answered by looking at the data.

We could make a long list of areas where we can apply spatial statistics: epidemiology, agriculture, ecology, environmental science, geology, meteorology, oceanography,… even econometrics. In all of these we could ask questions like the following to recognize if our data have a spatial structure:

  • Does the distribution of cases of a disease form a pattern in space?Could we relate health outcomes to geographic risk factors?
  • Do they influence environmental and geographical factors in the distribution and variability in the occurrence of fishery species?

But, how can we explain what is spatial statistics? I am sure that we could find infinite definitions of spatial analysis or spatial statistics. We can say that spatial statistics is responsible for analyzing the variability of random phenomena that are linked with their geographic locations. It is used to model the dependence of the georeferenced observations (these can be point observations or areal data).

Depending on the type of data and the purpose of the spatial analysis itself, we classify spatial data sets into one of three basic types: geostatistical data, lattice data (or areal data) and point pattern data. The main dissimilarity between these resides in the set of observations. Let us look at this briefly with some examples:

  • Geostatistics: the geostatistical data are represented by a random vector Y(s), where the locations s \in D varies continuously over a fixed observational region (D \subset \Re^{r}). These data are characterized by spatial dependence between the locations, and the main objective in application of geostatistics is to do predictions in unobserved locations from study region. kriging is the best known technique for prediction in geostatistical data. Some examples of this data are: occurrence of species in a region, annual acid rain deposition in a town, etc.


  • Lattice data: the fixed subset (D \subset \Re^{r}) of observations are located in a continuous region (it can have a regular or irregular shape). It is partitioned into a finite number of geographical areas with well-defined boundaries. A characterization of this type of spatial data is that neighbouring areas are usually more similar than distant ones. An example can be the observed locations from agricultural field trials (here the plots are a regular lattice).
  • Point pattern data: this is the last type of spatial data where D \subset \Re^{r} is itself random. The own locations determined phenomena that occurred randomly in one place, thus generating spatial patterns. One example of point pattern data is the locations of a certain species of tree in a forest (here only the locations are thought of as random).

The above explanation have only been a briefly description of three main types of spatial data. There are two basic methodologies to carry out spatial analysis: through classical statistics or by Bayesian approach (using mainly hierarchical Bayesian methods). The latter will be dealt with in more detail in future posts, given their importance as they can be applied in many situations, like spatial statistics (you can see the book ‘Hierarchical Modeling and Analysis for Spatial Data by Banerjee et al.’ for more information about this).


The complex structure of the longitudinal models

Two weeks ago, we started to talk in this blog about longitudinal data with the post by Urko Agirre. This type of data involves complex structure models called longitudinal models.

Longitudinal studies have two important characteristics:

  1. They are multivariate because for each studied individual many temporal measurements from the response variable (and covariates) are collected.

  2. They are multilevel as the variables measured are nested within the subjects under study, therefore resulting in layers.

These characteristics allow us to make inference about the general trend of the population as well as about the specific differences between subjects that can evolve in another way regarding the overall average behavior.

At the beginning of the 20th century this type of data started to be modelled. Different proposals appeared such as ANOVA models (Fisher, 1918), MANOVA models (generalised from ANOVA models to multivariate) or growth curves (Grizzle and Allen, 1969). All these proposals showed improvements in some aspects. However, they left some others unresolved. On the one hand, ANOVA models are univariate and our data are multivariate. On the other hand, MANOVA models are multivariate but assume  independence between intra-subject observations (observations from the same individual are not independent in general). Finally, the last option, growth curves, contemplate intra-subject observations dependence but are too restrictive on the design matrix.

It was not until the early 80s when a proposal that included all the aspects of these complex data appeared. Laird and Ware proposed the application of linear mixed models (LMM) in the paper “Random-Effects Models for Longitudinal Data“.

The basic structure of these LMM of each patient i is:

y_i = X'_i\beta + Z'_ib_{i} + W_i(t_i) + \epsilon_i


  • y_i=(y_{i1},...y_{in_i}) is the vector of measurements of the response variable, made to ith a total of m subjects in times t_i=(t_{i1},...t_{in_i}). n_i is the number of repeated measured of the ith patient.

  • X'_i\beta represents the deterministic model, being X'_i the submatrix design covariates associated with the ith individual, and \beta its associated parameter vector.

  • Z'_ib_i are random effects responsible for capturing the variability between individuals.

  • W_i(t_i) includes intra individual variability, ie the variability between observations of the same subject.

  • Finally, \epsilon_i reflects the variability that is not due to any kind of systematic error that we can determine.

To specify each of these elements it is essential to first do the descriptive graphics presented in this previous post.

We have learned more about repeated measures and in next posts we will talk  more about that because it is only the beginning. To be continued!!!


Mining genomic databases with R

When dealing with genomic data, retrieving information from databases is a required endeavor.  Most of these repositories of genomic and proteomic data have very useful and friendly interfaces or browsers perfectly suitable for particular searches. But if you are working with high dimensional data e.g. NGS data, more efficient tools to mine those databases are required. R offers several packages that make this task straightforward.

NCBI2R is an R package to annotate lists of SNPs, genes and microsatellites. It obtains information from NCBI databases. This package provides quite useful functions for getting a quick glance at a given set of SNPs or genes, retrieving SNPs in a gene or getting genes belonging to a KEGG pathway (among some other features) as shown in the code below.

mysnps <- GetSNPsInGenes(4137)
myset <- GetIDs("KEGG pathway:Alzheimer´s disease")

biomaRt package, part of Bioconductor project, is an R interface to the well-known BioMart data management system that provides access to several data sources including Ensemble, HGNC, InterPro, Reactome and HapMap.

Most of analysis can be performed with just two functions: useMart() to define the dataset, and getBM() to perform the query. It performs onlique queries baed on attributes – values we are interested in retrieving-,  filters – restrictions on the query- and a given set of values of the filter. Once we define the dataset we are interested in, we can check all the filters and attributes available. Let´s say we want to look for those genes associated with an OMIM phenotype.

source( "http://www.bioconductor.org/biocLite.R")
mymart <- useMart("ensembl",dataset = "hsapiens_gene_ensembl")
getBM(attributes = c("entrezgene", "hgnc_symbol", "mim_morbid_description"), filters ="entrezgene", values = myset, mart = mymart)

Org.Hs.eg.db , which is also part of Bioconductor, is an organism specific package that provides annotation for the human genome. It is based on mapping using Entrez Gene identifiers. In a very similar way to biomaRt, it allows to query  the databases by means of the select() function specifying cols – kind of data that can be returned-, keytypes -which of the columns can be used as keys-  and a given key.

select(org.Hs.eg.db, keys= "MAPT", cols = c("SYMBOL", "CHR", "CHRLOC", "UNIPROT"), keytype = "SYMBOL")

These three packages offer in many cases a good alternative to other well-known data mining systems such as UCSC Table Browser data retrieval tool. Besides, commands from these packages are quite simple, so even those not that familiar with R language can take good benefit from them.

Just a quick tip to finish – I have just found out about fread() function (data.table package) for reading large files. It really makes a difference!


From Descriptive to Repeated Measures Data….one small step for studies, one giant leap for (bio)statistics

Traditional epidemiological descriptive studies, also called cross-sectional  studies,  have been characterized for reporting population health, describing the existing distribution of the collected exposure factors, variables, without relating to other hypotheses.  In other words, they should try to give an answer to three basic “W” questions: who, where and when. Most important uses of this kind of research include health planning and hypothesis generation. Nonetheless, the most important pitfall is that researchers might draw causal inferences when developing this type of studies. Temporal associations between the effects and the outcomes of interest might be unclear. Thus, when a researcher wants to verify the causality effect between two variables, a more appropriate design is highly recommended, such as a study with two or more observations per subject collected over the established research period. The latter design corresponds to repeated measurement data structure, more specifically, to a longitudinal data analysis (a common repeated analysis form in which measurements are recorded on individual subjects over time).

As mentioned in the previous paragraph, the main difference between both research study designs, cross-sectional and longitudinal, is that each experimental unit participating in the first one is observed only once, so for each exposure factor one has only one value per subject. In other words,  each row in the dataset is an observation. However, in longitudinal data each subject is observed  more than once.

It is also worth pointing out an increase in the complexity of the statistical approaches when moving from descriptive analysis to repeated data studies. For instance, in the first setting the statistical methods in use are the simplest ones: mean and percentage comparisons by means of classical tests, regression analysis, etc…However, in repeated measures data sets, and specifically in longitudinal data analysis, is required to  use special statistical techniques for valid analysis and inference. Thus, researchers should be aware of three important points to perform a proper statistical model, in this order:  (1) the trend of the temporal component; (2) the variance-covariance structure; (3) the mean structure. More accurately, the overall trend of the evolutive analysis should be guessed first of all. Temporal trends can follow a linear, quadratic, cubic or even a fourth grade polynomial function. Besides, as observations in the same subject are more likely to be correlated, repeated measures analysis must account for this correlation (the within and between-subject effects must be controlled). Among the possible covariance structures, compound symmetry, unstructured  and  first-order autoregressive  are the most used.  As for the mean structure, the potential exposure factors which could be related with the dependent variable should be included in the model.


Longitudinal studies play an important key role, mostly in epidemiology and clinical research. They are used to determine the change in the outcome of measurement or to evaluate the effectiveness of a new treatment in a clinical trial, among other applicable settings. Under these scenarios, due to the complexity of the statistical analyses,  longitudinal studies involve a great deal of effort, but they offer several benefits. The most importants, from my point of view, are the following: (1) The ability to measure change in outcomes and/or exposure at the individual level, so that the researcher has the opportunity to observe individual patterns of change; (2) the temporal order of the exposure factors and the outcomes is measured. Therefore, the timing of the outcome onset can be correlated with the covariates.

Finally, there is no specific statistical package to perform this kind of analyses. Nowadays, most of them include on their recent releases all the procedures to perform, at least, a basic longitudinal analysis. Now, there is no excuse for identifying a repeated/longitudinal analysis from a descriptive one, and developing them without any doubt…. Do you agree??


Tricky bits: Ordinal data

From questionnaire responses studies to plant-flowering stages and drugs effects scoring analyses, researchers and biostatisticians often face situations when some of the variables are not continuous but present fixed ordered categories instead.

The straightforward approach would be to deal with this data taking into account its full nature. However, in practice, when trying to do the analysis, many professionals still take a continuous (interval scale) approach in order to ensure that the most familiar statistical techniques can be employed.

Carifio and Perla claim that this debate on the use and “abuse” of the so-called Likert scales has been going on for over 50 years, and state as one of the main advantages of what they refer to as “intervalist position”,  the easy access to both traditional and more complex techniques based on the former.

Other authors such as Jamieson and Kuzon Jr. et al advocate for tailored procedures to deal with this kind of variables. Their argument being based on the need to reflect the ordered nature of the data and the difficulty of measurement of distances between the different categories within variables in the case of a continuous approach.


Concerns regarding plotting this data have often been raised too. From pie to divided bar charts (see figure above and other examples here and here), it seems difficult to decide which form of visualisation is more appropriate and easily understandable.

Thankfully, the arrival of new graphical applications and user-friendly specialised software like R packages ordinal and MCMCglmm, is helping bringing consensus closer and closer so we can all be soon speaking the same “ordinal language”.

For or against? I look forward to reading your views on this!!

Note: A highly recommended book on the topic is Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed), Wiley.


2nd Biostatnet General Meeting Review

With a marked focus on young researchers in particular and health-related Biostatistics in general, this 2nd Biostatnet General Meeting, celebrated in Santiago de Compostela (Spain) the 25th and 26th of January, has been a fantastic opportunity for the Network´s members to gather together and discuss common topics of concern as well as successful stories.

FreshBiostats bloggers participated actively and now want to make our readers witnesses of this stimulating event.

7 of the 8 Biostatnet´s main researchers

After the welcome and opening session chaired by Carmen Cadarso, focusing on presentations on the past activities of the Network by Emilio Letón, David Conesa, Inmacularada Arostegui, and Jordi Ocaña, a busy program of events was fitted in a day and a half conference-like event:

Young researchers oral communications

Because of the meeting´s high participation, oral communications by young researchers of Biostatnet, were divided into three sections:

  • BIO session

The topics discussed in this first parallel session were the choice of primary end-points by using a web application interface by Moisés Gómez-Mateu, the modeling of a non proportional hazard regression by Mar Rodríguez-Girondo, and the randomization tests implemented in clinical trials by Arkaitz Galbete. The second part of the session continued with two talks on Chronic Kidney Disease but from two different approaches: the first one, from a survival analysis (competing risks analysis) point of view, was presented by Laetitia Teixeira, and the second one, based on longitudinal analysis (Bayesian longitudinal models), was defended by Hèctor Perpiñán. Finally, Mónica López-Ratón presented his work on estimation of generalized symmetry pointS for classification in continuous diagnostic tests. This session was moderated by Carles Serrat.

  • STAT session

A varied arrangement of talks were framed within the STAT session that featured the interesting view of Joan Valls on the experience of the biostatisticians working in the IRBLleida, two applications of Structured Additive Regression (STAR) models by Elisa Duarte and Valeria Mamouridis, a comparative analysis of different models for the prediction of breast cancer risk by Arantzazu Arrospide, an optimal experimental design application presented by Elvira Delgado, and a simulation study on the performance of Beta-Binomial SGoF multitesting method under dependence.

  • NET session

In this third parallel session, topics such as “bio” research as well as others related to design of experiments were covered. Irantzu Barrio started with a talk on development and implementation of a methodology to select optimal cut-points to categorise continuous covariates in prediction models. Also in this session, Mercedes Rodríguez-Hernández presented her work on D-optimal designs for Adair models.

Also covering “bio” topics,  a talk on derivative contrasts in quantile regression was given by Isabel Martínez-Silva. María Álvarez focused afterwards on the application of the method of maximum combination when comparing proportions. The two last communications dealt with the cost-efectiveness study of treatments for fracture prevention in postmenopausal women by Nuria Pérez-Álvarez, and the application of Generalised Additive Mixed Models for the assessment of temporal variability of mussel recruitment, by María P. Pata.

 Congratulations to the happy winners!!

To conclude these three sessions, Moisés Gómez-Mateu and Irantzu Barrio,  the two winners of both ERCIM´12 Biostatnet invited sessions, received their awards (see picture above).

Posters sessions

Two posters sessions were also included within the hectic program of the meeting, covering a wide range of topics varying, for instance, from the analysis of clinical and genetics factors (Aurora Baluja) to a collective blogging experience like ours (find it here)

As a courtesy to the young researchers participating in the meeting, Biostatnet´s main researchers gave each of us  The Cartoon Guide to Statistics, which definitely finds the fun side of Statistics (see the snapshot below for a nice example ;P). We are very grateful for this gift and promise to make good use of it, maybe by trying to convince those that are still skeptical about the enjoyable side of this amazing science!

Image extracted from “The Cartoon Guide to Statistics”


Throughout the meeting a total of 5 sessions of roundtables and colloquiums took place. Both professionals in the field of Biostatistics as well as young researchers participated and offered their views on different topics.

“Biostatisticians in biomedical institutions: a necessity?” was the first of the interventions of the meeting, which was covered by Arantza Urkaregi, Llorenç Bardiella, Vicente Lustres, and Erik Cobo. They attempted to respond the question with their professional experiences. The answer was unanimously positive.

The colloquium “Genomics, Biostatistics and Bioinformatics” was chaired by Malu Calle and featured presentations from Pilar Cacheiro (one of our bloggers), Roger Milne, Javier de las Rivas, and Álex Sánchez. They emphasized the importance of bringing together biostatistics and bioinformatics in the “omics” era, and a vibrant discussion followed regarding the definition of both terms.

The “Young Researchers roundtable” also generated a refreshing discussion about opportunities for young researchers in Biostatistics. Again, two of our bloggers, Altea Lorenzo and Hèctor Perpiñán, were involved in the session along with Núria Pérez and Oliver Valero, with Moisés Gómez as moderator and Isabel Martínez as organiser. The main conclusions reached in this table were the need for the young biostatisticians to claim their important role in institutions, the aspiration to access specialised courses on the field, and the importance of communication, collaboration, and networking.

In the second morning, another very important topic, “Current training in Biostatistics”, was presented by three professors, Carmen Armero, Guadalupe Gómez and José Antonio Roldán, who currently teach in Biostatistics masters and degrees programmes offered by Spanish universities. Some interesting collective projects were outlined and will hopefully be implemented soon.

Plenary talk

We cannot forget the invited talk on “Past and Current Issues in Clinical Trials” by Urania Dafni, director of  the Frontier Science Foundation-Hellas and Biostatistics professor and director of the Laboratory of Biostatistics of the University of Athens´ Nursing School of Health Sciences. An overall view of this hot topic and the importance of the presence of biostatisticians in the whole process of design and development of drugs was given by this reputed professional in the field. This session and the following discussion were moderated by Montserrat Rué.

Closing colloquium

Finally, a session on the future of Biostatnet and the different alternatives for development and improvement was chaired by Jesús López Fidalgo and María Durbán, with the collaboration of experts on international and national research projects funding, Martín Cacheiro and Eva Fabeiro, and Urania Dafni, director of the greek node of the International Network Frontier Sience.

From what was shown, it seems like the year ahead is going to be a very busy and productive one for the Network and its members. All that we have left to say is… We are already looking forward to the 3rd General Meeting!!


Your comments on the Meeting and this review are very welcome,       let´s keep the spirit up!


Interview with…Natàlia Adell Calvet

Natàlia AdellNatàlia Adell is a graduate in Statistics from the Universitat Politècnica de Catalunya. She also has a master´s degree in Statistical and Operations Resarch from the same University. She worked in KantarMedia and in the Statistical Service of the Universitat Autònoma de Barcelona. At present, she works in the Statistical Assessment Unit of the Research Technical Services of the University of Girona.

Contact Natalia

+34 680778844


1. Why do you like Biostatistics?

Because I like applied Statistics and if you can contribute to a good cause such as decreasing the number of illnesses, you will have all the right ingredients for good science.

2. Could you give us some insight in the work you develop at the Statistical Assessment Unit of the UdG´s Research Technical Services?

My main perception is that people need statisticians to help with a part of their research, studies… Statistics is a science that other scientists need and the Statistical Assessment Unit tries to provide it.

3. What were the main difficulties you found when setting up the unit?

The main difficulty was getting started. We had to organise the unit, establish all the procedures, and also let the community know about us. The most important thing I had was the support of all the people around me, who helped every time I needed it (and still do).

4. Is it possible to combine consultancy/advice and research?

Well, in our case, we dedicate ourselves just to the consultancy and giving advice because doing research is not the aim of the Statistical Assessment Unit. But it might be possible to combine both, because some doubts arise from research, and some questions need a research approach so they can be related.

5. What do you think of the situation of young biostatisticians in Spain?

I think  biostatisticians usually work alone, without the support of other statisticians and, in my opinion, it would be interesting to share knowledge with other biostatisticians. So I hope that BioStatNet and FreshBiostats will allow that! :-)

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Listening, communicating and having a deep knowledge of Statistics. If you have these three characteristics, you can be a good biostatistician.

7. What do you think are the main qualities of a good mentor?

I think the most important skill is to be organised, knowing the steps you need to take to achieve your goal.  Explaining difficult technics in a clear way will also be appreciated.

8. Finally, is there any topic you would like to see covered in the blog?

Sample size could be a theme of interest!

Selected publications:

  • Adell, N., Puig P., Rojas-Olivares, A., Caja, G., Carné, S. and Salama, A.A.K. A bivariate model for retinal image identification. Computers and Electronics in Agriculture. 2012; 87: 108-112. Epub 2012 June.
  • M. A. Rojas-Olivares, G. Caja, S. Carné, A. A. K. Salama, N. Adell, and P.Puig. Determining the optimal age for recording the retinal vascular pattern image of lambs.  Journal of Animal Science. 2012; 90 (3): 1040-6. Epub 2011 Nov 7.
  • Rojas-Olivares M.A., Caja G., Carné S., Salama A.A.K., Adell N., Puig P. Retinal image recognition for verifying the identity of fattening and replacement lambs. Journal of Animal Science. 2011; 89 (8): 2603-13. Epub 2011 Feb 4.
  • Martínez-Vilalta J, López BC, Adell N, Badiella L & Ninyerola M (2008). Twentieth century increase of Scots pine radial growth in NE Spain shows strong climate interactions. Global change biology. 2008; 14, nº 12: 2868-2881.