Mixed Models in Sports and Health

As in many other research areas, mixed models have become widely applied in sports science and related health issues. Within sports science, different branches account for the interests of different stakeholders; general managers, coaches, teams, supporters, scientists, academics. Human performance and medicine (treatment and prevention of injuries) lie behind them all and these models are a way to account for within subject variability, time-dependent covariates, and nested data.

On the competition side, efforts have been made in the literature to find ways to model player performance. Casals and Martínez (2013) approach this problem in the context of the NBA basketball league results by considering a balanced design where player is the random intercept and the forward stepwise approach to model selection by Pinheiro and Bates (2000) has been adopted to determine additional factors to be included in the final model. The great advantage of their proposed models for points and win scores over previous studies is that they account for the variation in players´ characteristics and therefore can predict potential variation in player performance using Linear Mixed Models (LMM) and Generalized Linear Mixed models (GLMM), and are consequently of great help for coaches, managers and other decision makers.

A similar methodology has been followed to predict scoring ability in the English football Premier League by McHale and Szczepański (2014). You may recall the post by Hèctor Perpiñán on calculating results probabilities via simulation. While in Perpiñán´s study only teams´ results were considered for the calculations, McHale and Szczepański´s mixed modelling approach allows for the inclusion of players´ ability to create and convert shots as a disaggregated factor from chance. The accuracy in the prediction of their models in this paper shows again the relevance of models that allow the inclusion of players´ characteristics (rather than just teams´).

Of particular note to our readers is the trend towards the implementation of mixed models in open source software exemplified in the aforementioned papers which use R (R Core Team, 2012) for their modelling, in particular packages nlme and lme4

In community exercises for promoting physical activity like the one described in Okely et al. (2011), one research aim has been to determine policies and interventions that help to encourage exercising during adolescence in order to alleviate risk factors in the long run. This particular school-based randomised controlled trial used mixed models to account for the hierarchical structure of the data (32 schools from four geographical regions). According to the authors, one of the greatest advantages of the methodology is that it “incorporates all available data allowing for the analysis of partial datasets created when a participant drops out of the study or misses a study visit.”

Moving on to health applications, injury prediction, for instance in baseball pitchers and athletes, can be modelled by deploying similar approaches to determine the existence of statistically significant differences between groups and within days post-injury for example.

Finally, in veterinary epidemiology mixed modelling has become equally mainstream, as discussed in a recent article by Stryhn and Christensen (2014). As an example of an application, animal risk factors associated with health disorders in sport can also be modelled via these techniques. Visser et al. (2013) have studied the effect in race horses in the Netherlands via a cross-sectional study and have considered a random farm effect. Commercial software has been used in these two previous examples. – SAS (PROC MIXED) and GenStat (GLMM)- which are again of common use in the application of mixed models.

As stated by Casals and Martínez (2013), phenomena like Moneyball have raised the profile of sports data analysis. For researchers, big data and more widely, the intrinsic complex structure of the data coming from the aforementioned fields –and I would add software availability- have stimulated application of these types of models and they seem to be here to stay…

These are some of the examples that we have found but we will sure be missing some other very interesting ones so please let us know…Are you a researcher in the area and would like to tell us about your experience? Have you used this sort of model in this or other areas and are you willing to share your experience? 

Main references:

Casals, M., & Martinez, J.A. (2013). Modelling player performance in basketball through mixed models Journal of Performance Analysis in Sport, 13 (1), 64-82

McHale, I., & Szczepański, L. (2014). A mixed effects model for identifying goal scoring ability of footballers Journal of the Royal Statistical Society: Series A (Statistics in Society), 177 (2), 397-417 DOI: 10.1111/rssa.12015

Okely, A., Cotton, W., Lubans, D., Morgan, P., Puglisi, L., Miller, J., Wright, J., Batterham, M., Peralta, L., & Perry, J. (2011). A school-based intervention to promote physical activity among adolescent girls: Rationale, design, and baseline data from the Girls in Sport group randomised controlled trial BMC Public Health, 11 (1) DOI: 10.1186/1471-2458-11-658

Visser, E., Neijenhuis, F., de Graaf-Roelfsema, E., Wesselink, H., de Boer, J., van Wijhe-Kiezebrink, M., Engel, B., & van Reenen, C. (2013). Risk factors associated with health disorders in sport and leisure horses in the Netherlands Journal of Animal Science, 92 (2), 844-855 DOI: 10.2527/jas.2013-6692



My impressions from the Workshop on the Future of the Statistical Sciences

An exciting workshop, organised as the capstone of the International Year of Statistics Statistics2013, was celebrated on the 11th and 12th of November at the Royal Statistical Society Headquarters in London, under the stimulating title of  “The Future of the Statistical Science”, science to which John Pullinger, president of the Royal Statistical Society, refers to as a quintessentially, multidisciplinary discipline. The workshop was indeed a good example of that.

Online attendance was welcomed to the event giving an additional opportunity to listen to highly reputed professionals in main areas of Statistics. Virtual participants were also allowed to pose their questions to the speakers, an innovation that worked very well as an excellent way to make these sorts of events available to a wider audience that would otherwise be excluded – I am already looking forward to more! -.

In case you missed it at the time, luckily the podcast is still available here.

True to its description, the event covered different areas of application, it showed the tools of the field for tackling a wide range of challenges, it portrayed potentially high impact examples of research, and in summary, was a great attention-grabbing exercise that will hopefully encourage other professionals to the area. A common vision of ubiquity and great relevance came across from the speeches and showed the field as a very attractive world to join. See my simplified summary below in form of a diagram certainly reflective of the great Statistics Views motto “Bringing Statistics Together”.


Fig 1. Some of the ideas discussed in the workshop

Particularly relevant to this blog and my particular interests were presentations on environmental Statistics and statistical Bioinformatics, as well as other health-related talks. Important lessons were taken on board from the rest of the speakers too.

The workshop started with a shared presentation on the hot topic “Statistical Bioinformatics” by Peter Bühlmann (ETH Zürich) and Martin Vingron (Free University of Berlin). In a fascinating talk, Bühlmann argued for the need of uncertainty quantification in an increasingly heterogeneous data world –examples of this are also appearing in other fields, e.g. in the study of autism disorders as Connie Kasari and Susan Murphy mentioned in “SMART Approaches to Combating Autism”- and for models to be the basis of the assignation of uncertainties – topic greatly covered by Andrew Gelman in “Living with uncertainty, yet still learning”-, while acknowledging that in the big data context, “confirmatory statistical inference is (even more) challenging”. Vingron followed by focusing on “Epigenomics as an example”, raising open questions to the audience on how to define causality in Biology where “most processes […] are feedback circular processes”, calling for models that are just complex enough so as to allow for mechanistic explanations, and for good definitions of null hypotheses.

In addition to the interesting points of the talk, I found its title particularly attractive in what it could be directly linked to a vibrant roundtable on “Genomics, Biostatistics and Bioinformatics” in the framework of the 2nd Biostatnet General Meeting, in which, as some of you might remember, the definitions of the terms Biostatistics and Bioinformatics were discussed. I wonder if the term “statistical Bioinformatics” would be indeed the solution to that dilemma? As a matter of fact, Bühlmann himself mentions at the start of his talk other options like Statistical Information Sciences, etc-

Michael Newton and David Schwartz from the University of Wisconsin-Madison also focused on the triumph of sequencing and “Millimeter-long DNA Molecules: Genomic Applications and Statistical Issues”. Breast cancer being one of the mentioned applications, this was followed by an introduction to “Statistics in Cancer Genomics” by Jason Carroll and Rory Stark (Cancer Research UK Cambridge Institute, University of Cambridge), particularly focusing on breast and prostate cancer and the process of targeting protein binding sites.  The latter, as a computational biologist, mentioned challenges on the translation of Statistics for the biologists and vice versa, and identified ”reproducibility as (the) most critical statistical challenge” in the area – also on the topic of reproducibility, it is especially worth watching “Statistics, Big Data and the Reproducibility of Published Research” by Emmanuel Candes (Stanford University) -.

Another area increasingly gaining attention in parallel with technologies improvements is Neuroimaging. Thomas Nichols (University of Warwick) lead the audience in a trip through classical examples of its application (from Phineas Gage´s accident to observational studies of the structural changes in the hippocampi of taxi drivers, and the brain´s reaction to politics) up to current exciting times, with great “Opportunities in Functional Data Analysis” which Jane-Ling Wang (University of California) promoted in her talk.

Also in the area of Health Statistics, different approaches to dietary patterns modelling were shown in “The Analysis of Dietary Patterns and its Application to Dietary Surveillance and Epidemiology” by Raymond J. Carroll (Texas A&M University), and Sue Krebs-Smith (NCI), with challenges in the area being; finding dietary patterns over the life course – e.g. special waves of data would appear in periods like pregnancy and menarchy for women-, and incorporation of technology to the studies as a new form of data collection –e.g., pictures of food connected to databases-.

Again with a focus on the challenges posed by technology, three talks on environmental Statistics outlined an evolution over time of the field. Starting from current times, Marian Scott (University of Glasgow) in “Environment – Present and Future” stated in her talk that “natural variability is more and more apparent” because we are now able to visualise it. This idea going back to the heterogeneity claimed by Bühlmann. Despite the great amounts of data being available, and initially caused by technological improvements but ultimately due to a public demand – “people are wanting what it is happening now”-, the future of the field is still subject to the basic questions: “how and where to sample.” Especially thought-provoking were Scott´s words on “the importance of people in the environmental arena” and the need for effective communication: “we have to communicate: socio-economic, hard science…, it all needs to be there because it all matters, it applies to all the world…”

Amongst other aims of this environmental focus such as living sustainably – both in urban and rural environments-, climate is a major worry/fascination which is being targeted through successful collaborations of specialists in its study and statisticians. “Climate: Past to Future” by Gabi Hegerl (University of Edinburgh) and Douglas Nychka (NCAR) covered “Regional Climate – Past, Present and Future” and showed relevant examples of these successful collaborations.

Susan Waldron (University of Glasgow), under the captivating speech title of “A Birefringent Future”, highlighted the need to address challenges in the communication of “multiple layers of information” too, Statistics appearing as the solution for doing this effectively –e.g. by creating “accessible visualisation of data” and by incorporating additional environmental variation in controversial green energy studies such as whether wind turbines create a microclimate-. Mark Hansen (Columbia University Journalism School) seconded Waldron´s arguments and called for journalists to play an important role in the challenge: “data are changing systems of power in our world and journalists. […] have to be able to think both critically as well as creatively with data”, so as to “provide much needed perspective”. Examples of this role are; terms such as “computational Journalism” already being coined and data visualisation tools being put in place -as a fascinating example, the New York Times´ Cascade tool builds a very attractive representation of the information flow in Social Media-. David Spiegelhalter (Cambridge University) also dealt with the topic in the great talk “Statistics for an Informed Public”, providing relevant examples (further explanation on the red meat example can be found here as well).

To encourage “elicitation of experts opinions (to reach a) rational consensus” and “to spend more time with regulators” came up as other useful recommendations in the presentation “Statistics, Risk and Regulation” by Michel Dacorogna (SCOR) and Paul Embrechts (ETH Zürich).

In a highly connected world inmersed in a data revolution, privacy becomes another major issue. Statistical challenges arising from networked data –e.g. historical interaction information modelling- were addressed by Steve Fienberg (Carnegie Mellon University) and Cynthia Dwork (Microsoft Research), who argued that “statisticians of the future will approach data and privacy (and its loss) in a fundamentally algorithmic fashion”, in doing so explicitly answering the quote by Sweeney: “Computer science got us into this mess, can computer science get us out of it?”. Michael I. Jordan (University of California-Berkeley) in “On the Computation/Statistics Interface and “Big Data”” also referred to “bring(ing)  algorithmic principles more fully into contact with statistical inference“.

I would like to make a final mention to one of the questions posed by the audience during the event. When Paul Embrechts was enquired about the differences between Econometrics and Statistics, a discussion  followed on how crossfertilisation between fields has happened back and forth. As a direct consequence of this contact, fields rediscover issues from other areas. For instance, Credit Risk Analysis models were mentioned as being inferred from Medical Statistics or back in Peter Bühlmann´s talk, links were also found between Behavioural Economics and Genetics. These ideas, from my point of view, bring together the essence of Statistics, i.e. its application and interaction with multiple disciplines as the foundations of its success.

Many other fascinating topics were conveyed in the workshop but unfortunately, I cannot fit here mentions to all the talks. I am sure you will all agree it was a fantastic event.


FreshBiostats birthday and September-born famous statisticians

With the occassion of the 1st birthday of FreshBiostats, we want to remember some of the great statisticians born in September and that have contributed to the “joy of (bio)stats”.

Gerolamo Cardano Pavia, 24 September 1501 – 21 September 1576 First systematic treatment of probability
Caspar Neumann Breslau, 14 September 1648 – 27 January 1715 First mortality rates table
Johann Peter Süssmilch Zehlendorf, 3 September 1707 – 22 March 1767 Demographic data and socio-economic analysis  
Georges Louis Leclerc (Buffon) Montbard, 7 September 1707 – Paris, 16 April 1788 Premier example in “geometric probability” and a body of experimental and theoretical work in demography
Adrien-Marie Legendre Paris, 18 September 1752 – 10 January 1833 Development of the least squares method
William Playfair Liff, 22 September 1759 – London, 11 February 1823 Considered the founder of graphical methods of statistics (line graph, bar chart, pie chart, and circle graph)
William StanleyJevons Liverpool, 1 September 1835 – Hastings,13 August 1882 Statistical atlas – graphical representations of time series
Anders Nicolai Kiaer Drammen, 15 September 1838 – Oslo, 16 April 1919 Representative sample
Charles Edward Spearman London, 10 September 1863 – 17 September 1945 Pioneer of factor analysis and Spearman´s Rank correlation coefficient
Anderson Gray McKendrick Edinburgh, September 8, 1876 – May 30, 1943 Several discoveries in stochastic processes and collaborator in the path-breaking work on the deterministic model for the general epidemic
Maurice Fréchet Maligny, 2 September 1878 – Paris, 4 June 1973 Contributions in econometrics and spatial statistics
Paul Lévy 15 September 1886 – 15 December 1971 Several contributions to probability theory
Frank Wilcoxon County Cork, 2 September 1892 – Tallahassee, 18 November 1965 Wilcoxon rank-sum tests, Wilcoxon signed-rank test
Mikhailo Pylypovych Kravchuk Chovnytsia, 27 September 1892- Magadan, 9 March 1942 Krawtchouk polynomials, a system of polynomials orthonormal with respect to the binomial distribution
Harald Cramér Stockholm, 25 September 1893 – 5 October 1985 Important statistical contributions to the distribution of primes and twin primes
Hilda Geiringer Vienna, 28 September 1893 – California, 22 March 1973 One of the pioneers of disciplines such as molecular genetics, genomics, bioinformatics,…
Harold Hotelling Fulda, 29 September 1895 – Chapel Hill, 26 December 1973 Hotelling´s T-squared distribution and canonical correlation
David van Dantzig Rotterdam, 23 September 1900 -Amsterdam,  22 July 1959 Focus on probability, emphasizing the applicability to hypothesis testing
Maurice Kendall
Kettering, 6 September 1907 – London, 29 March 1983 Random number generation and Kendall´s tau
Pao-Lu Hsu Peking, 1 September 1910 – Peking, 18 December 1970 Founder of the newly formed discipline of statistics and probability in China

It is certainly difficult to think of the field without their contributions. They are all a great inspiration to keep on learning and working!!

Note: you can find other interesting dates here.

Update: and Significance´s timeline of statistics here.

Any author you consider particularly relevant? Any other suggestions?


Infographics in Biostatistics

Although the history of Infographics according to Wikipedia does not seem to mention Fritz Kahn as one of the pioneers of this technique, I would like to start this post mentioning one of the didactic graphical representations of this Jewish German doctor, who was highly reputed as a popular science writer of his time.

Apart from his fascinating views of the human body in the form of machines and industrial processes, I am particularly attracted by his illustration below, summarising the evolution of life in the Earth as a clock in which the history of humans would not take more than a few seconds…

Animal clock

Image extracted from the printed version of the article “Fritz Kahn, un genio olvidado” published in El País, on Sunday 1st of September 2013.

 What could be understood by some as a naive simplification of matters requires, in my opinion, a great deal of scientific knowledge and it is a fantastic effort to communicate the science behind very complex mechanisms.

This and more modern infographic forms of visualisation represent an opportunity for statisticians –and more specifically biostatisticians-, to make our field approachable and understandable for the wider public. Areas such as Public Health (see here), Cancer research (find examples here and here), and Drug development (see here) are already using them, so we should not be ashamed to make of these “less rigorous” graphical representations an important tool in our work.

Note: There are plenty of resources available online to design a nice infographic in R. For a quick peek into how to create easy pictograms, check out this entry in Robert Grant´s stats blog. Also, the wordcloud R package will help you visualising main ideas from texts…

We will soon show a practical example of these representations in this blog, keep tuned!


Mixed Models in R: lme4, nlme, or both?

The topic of Mixed Models is an old-friend of this blog, but I want to focus today on the R code for these models.

Amongst all the packages that deal with linear mixed models in R (see lmm, ASReml, MCMCglmm, glmmADMB,…), lme4 by  Bates, Maechler and Bolker, and nlme by Pinheiro and Bates are probably the most commonly used -in the frequentist arena-, with their respective main functions lmer and lme.

I am still unsure as to which one I would choose – if I had to -, but I am sharing here a summary of some of their capabilities, in case it can be of help:

Model specification

I will be using for all of the following examples the balanced dataset Orthodont from the package nlme, which reflects the change in an orthodontic measurement over time for a sample of 27 children (see Fig. 1).

Spaghetti plot_Orthodont

Fig. 1. Spaghetti plots. Distance vs age by gender – dataset Orthodont

For simplicity´s sake, I will consider the following initial models with a simple random factor (please see ref. [3] for centering and further analysis):

model_lme<-lme(distance~age+Sex, random=~1|Subject,data=Orthodont)


  • lmer

The results for t-tests and F-tests based on Restricted Maximum Lilkelihood (REML) can be found by using the following lines of code (you can add REML=FALSE to change this default setting):


Fixed effects:

                           Estimate       Std. Error    t value

(Intercept)          17.70671       0.83391     21.233

age                      0.66019        0.06161     10.716

SexFemale           -2.32102        0.76139    -3.048

(Please notice that the reference category in Sex can be changed by using relevel).


Analysis of Variance Table

       Df      Sum Sq     Mean Sq      F value

age  1      235.356    235.356    114.8383

Sex  1        19.034      19.034        9.2875

  • lme

Conditional t-tests and F-tests are used for assessing the significance of terms in the fixed effects in lme.

Both F and t conditional tests results are based on REML conditional estimate of the variance. This option will be the default, but we can specify method=”ML”   for Maximum Likelihood estimates.

For the results from the conditional t-test testing the marginal significance of each fixed effect coefficient with the other fixed effects in the model, we can use:


Fixed effects: distance ~ age + Sex

                                Value      Std.Error    DF         t-value       p-value

(Intercept)       17.706713   0.8339225    80     21.233044      0.0000

age                     0.660185   0.0616059   80     10.716263      0.0000

SexFemale        -2.321023   0.7614168    25     -3.048294      0.0054

For the results from the conditional F-test testing the significance of terms in the fixed effects model (sequentially by default), we can use:


                         numDF      denDF      F-value    p-value

(Intercept)           1                80      4123.156    <.0001

age                       1               80      114.838     <.0001

Sex                       1               25        9.292       0.0054

(The argument type would also allow us to specify marginal F-tests).

These conditional tests for fixed-effects terms require denominator degrees of freedom, which will be the focus of the next section.

Degrees of freedom

  • lme

The denominator degrees of freedom correspond to “the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs” [3]. It is worth noticing that the values for these degrees of freedom do not match those provided by other software procedures such as SAS PROC MIXED (see discussions on the topic here and here).

Additionally to the denominator degrees of freedom aforementioned, conditional F-tests also require numerator degrees of freedom defined by the term (see output from previous section).

  • A good explanation regarding the reporting of degrees of freedom in lmer is given by the author of the package in this article (page 28).


  • lme reports p-values (see previous output), whereas
  • lmer doesn’t but this has been justified by Bates.

Random effects

  • lme allows for nested random effects in a very straightforward way (random=~1|a/b, where factor b is nested in a). Crossed random effects on the other hand, can be dealt with through “a combination of pdBlocked and pdldent objects” [3].
  • Nested random effects can again be easily modelled in lmer (+(1|a/b)). Crossed random effects are handled in an easier way than in lme (+(1|a)+(1|b)). You can find further explanations in [2].

Random effects plots

Two different approaches to the plotting of the random effects can be obtained through the following lines of code:

  • lme
  • lmer

Random effects_plots

Fig. 2. Random effects plots for model_lme and model_lmer.

Residuals plots

  • lme allows to plot the residuals in the following ways:

Plots_lmeFig. 3. Residual plots for model_lme.

  • We can also plot the first two graphs in lmer, but the last line of code does not seem to work with this function.

Correlation structure

  • We can easily incorporate correlation structures in lme.  Mostly used for temporal correlation structures are corAR1, corCAR1 (autoregressive and continuous autoregressive correlation structures of order 1), and corCompSymm (compound symmetry).
model_lme<-lme(distance ~ age + factor(Sex),random = ~ 1 | Subject, cor=corAR1(0.6,form=~1|Subject),data = Orthodont)

Correlation Structure: AR(1)

Formula: ~1 | Subject

Parameter estimate(s):



Further structures available can be found in the help for corClasses.

  • There is not an argument in lmer for doing this, but guidance on how to incorporate the structures can be found here.


  • lme allows you to model heteroscedasticity using the varFunc object,  but
  •  it is not covered by lmer.

Although some other differences could be mentioned, future posts will cover those other matters.

The following books have been found extremely useful:

[1] Badiella, L., Sánchez, J.A. “Modelos mixtos con R versión 0.3” ( from the course “Modelos Mixtos utilizando R”)

[2] Bates, D. (2010). lme4: Mixed-effects modelling with R. Springer

[3] Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer

[4] Zuur, A.F. et al. (2009). Mixed effects models and extensions in ecology with R. Springer

I will be more than happy to read your comments and tips! 🙂


Side effects of open health data

Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point.

At the forefront of these changes, reputable publications have taken measures to set transparency standards.  Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012)

In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid).


Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only)

However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10th of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28th of June 2013) .

While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several  doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients.

In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.”

Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues…

Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors.

What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation?


…a scientific crowd

While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.

Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.

The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”

Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here). sociogram_biostatistics

Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)

In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).


Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17th of May 2013.

Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.

That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!

Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!

A pinch of OR in Biostatistics

The first thing in common between Operational Research (OR) and Biostatistics, is that both terms are often misunderstood. OR, or the Science of Better, as it is commonly known, has lots to do with optimization but there is much more to it…Some might be surprised to read that many of the tools of this discipline can be -and are actually- applied to problems in Biostatistics.

It all starts with Florence Nightingale  volunteering as a nurse in the Crimean war in 1853…Her pioneering use of statistics for evidence-based medicine, and her operational research vision to reform nursing, led to a decline in the many preventable deaths that occurred throughout the nineteenth century in English military and civilian hospitals.

Since then, integrated approaches are commonplace: disease forecasting models, neural networks in Epidemiology, discrete-event simulation in clinical trials,…


Figure: Genetic algorithm representation using R. Find here the code for the original dynamical plot by FishyOperations.

There is also an increasing interest in computational biology in OR. Examples of application of these techniques vary from linear programming based flux balance analysis models for cell metabolism studies, to sparse network estimation in the area of Genetics.

In R, there are many packages with which to apply these techniques, from genalg for Genetic algorithms (used to recreate figure above) or games on Games theory, to more specific packages in Bioconductor like CellNOptR, survcomp or OLIN. Also, a general Task View on Optimization can be found here.

Finally, a quick mention to the YoungOR 18 Conference that will be held in Exeter (UK), 9-11 April 2013. It will be covering topics of common interest for biostatisticians with streams on Health, Sustainability, etc. Plenty of inspiration for future posts!

Have you ever used any of these techniques?Any particular tips that you want to share?Tell us about it!

Tricky bits: Ordinal data

From questionnaire responses studies to plant-flowering stages and drugs effects scoring analyses, researchers and biostatisticians often face situations when some of the variables are not continuous but present fixed ordered categories instead.

The straightforward approach would be to deal with this data taking into account its full nature. However, in practice, when trying to do the analysis, many professionals still take a continuous (interval scale) approach in order to ensure that the most familiar statistical techniques can be employed.

Carifio and Perla claim that this debate on the use and “abuse” of the so-called Likert scales has been going on for over 50 years, and state as one of the main advantages of what they refer to as “intervalist position”,  the easy access to both traditional and more complex techniques based on the former.

Other authors such as Jamieson and Kuzon Jr. et al advocate for tailored procedures to deal with this kind of variables. Their argument being based on the need to reflect the ordered nature of the data and the difficulty of measurement of distances between the different categories within variables in the case of a continuous approach.


Concerns regarding plotting this data have often been raised too. From pie to divided bar charts (see figure above and other examples here and here), it seems difficult to decide which form of visualisation is more appropriate and easily understandable.

Thankfully, the arrival of new graphical applications and user-friendly specialised software like R packages ordinal and MCMCglmm, is helping bringing consensus closer and closer so we can all be soon speaking the same “ordinal language”.

For or against? I look forward to reading your views on this!!

Note: A highly recommended book on the topic is Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed), Wiley.

B for Biology: Not just counting sheep

As some of my co-bloggers have mentioned before, Biostatistics has been closely associated lately with studies in the health sciences and has somehow forgotten the wider biological side of things. I will be focusing today on ecological and environmental matters and the statistical approach to this kind of problems.

According to Smith (Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 589-602; John Wiley & Sons, 2002), ecological Statistics can be defined as the area of Statistics that focuses on ecological problem solving, where Ecology can be understood as the scientific study of the distribution and abundance of organisms. It will cover, therefore, “sampling, assessment, and decission making for both policy and research” (Patil, G.P. , Environmental and Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 672-674; John Wiley & Sons, 2002), and will require of advanced techniques to ensure the correct modelling of complex univariate and multivariate relationships (often nonlinear) from both spatial and temporal perspectives.


To fully understand this field of study, we would initially need to make a clear distinction between single species and multispecies analysis, two diametrically opposed approaches calling for different statistical strategies.

The former is mainly based on measurements of the species abundance and performance (survival, growth, and recruitment). As such, it encounters an old dilemma: how to keep observational bias to a minimum? Petersen and transect methods are used to monitor wildlife census and avoid this and other biases, and advanced methodology like mixed models, flexible regression techniques, spatial and temporal statistics, and Bayesian inference are applied in the analysis itself.

Multispecies analysis on the other hand, deals with the complicated interactions and dependencies existing in the various ecosystems. The notions of diversity – measuring global changes in different species as a community, and mostly criticized for the potential lack of ecological relevance of some of the measures – and integrity – metrics accounting for a certain ecosystem unimpaired state; read this interesting article for further discussion on the difference between health and integrity –  are its main pillars. Multivariate analysis of ecosystems includes methods like correspondence analysis and redundancy analysis, amongst many others.

It is worth noticing that the latter is nowadays a major focus of research as a direct consequence of an increasing public awarenes of the need to preserve endangered ecosystems in order to ensure the whole planet´s good health.

In the end it is not just about counting sheep but how to count them, ensuring representativity, and considering issues like diversity and integrity in their relationships with other species.

Note: as a proof of the importance of this broad area on its own, there is a multidisciplinary journal dedicated to the topic, “Environmental and Ecological Statistics”, and an exhaustive R Task View called “Analysis of Ecological and Environmental Data” available here. “Analyzing Ecological Data” (2007) by Zuur, Ieno & Smith is also highly recommended.

Have you faced any of these problems? Any tips? Many thanks for your comments!!