Featured

# Mixed Models in R: lme4, nlme, or both?

The topic of Mixed Models is an old-friend of this blog, but I want to focus today on the R code for these models.

Amongst all the packages that deal with linear mixed models in R (see lmm, ASReml, MCMCglmm, glmmADMB,…), lme4 by  Bates, Maechler and Bolker, and nlme by Pinheiro and Bates are probably the most commonly used -in the frequentist arena-, with their respective main functions lmer and lme.

I am still unsure as to which one I would choose – if I had to -, but I am sharing here a summary of some of their capabilities, in case it can be of help:

Model specification

I will be using for all of the following examples the balanced dataset Orthodont from the package nlme, which reflects the change in an orthodontic measurement over time for a sample of 27 children (see Fig. 1).

Fig. 1. Spaghetti plots. Distance vs age by gender – dataset Orthodont

For simplicity´s sake, I will consider the following initial models with a simple random factor (please see ref. [3] for centering and further analysis):

model_lmer<-lmer(distance~age+Sex+(1|Subject),data=Orthodont)
model_lme<-lme(distance~age+Sex, random=~1|Subject,data=Orthodont)


Tests

• lmer

The results for t-tests and F-tests based on Restricted Maximum Lilkelihood (REML) can be found by using the following lines of code (you can add REML=FALSE to change this default setting):

summary(model_lmer)


Fixed effects:

Estimate       Std. Error    t value

(Intercept)          17.70671       0.83391     21.233

age                      0.66019        0.06161     10.716

SexFemale           -2.32102        0.76139    -3.048

(Please notice that the reference category in Sex can be changed by using relevel).

anova(model_lmer)


Analysis of Variance Table

Df      Sum Sq     Mean Sq      F value

age  1      235.356    235.356    114.8383

Sex  1        19.034      19.034        9.2875

• lme

Conditional t-tests and F-tests are used for assessing the significance of terms in the fixed effects in lme.

Both F and t conditional tests results are based on REML conditional estimate of the variance. This option will be the default, but we can specify method=”ML”   for Maximum Likelihood estimates.

For the results from the conditional t-test testing the marginal significance of each fixed effect coefficient with the other fixed effects in the model, we can use:

summary(model_lme)


Fixed effects: distance ~ age + Sex

Value      Std.Error    DF         t-value       p-value

(Intercept)       17.706713   0.8339225    80     21.233044      0.0000

age                     0.660185   0.0616059   80     10.716263      0.0000

SexFemale        -2.321023   0.7614168    25     -3.048294      0.0054

For the results from the conditional F-test testing the significance of terms in the fixed effects model (sequentially by default), we can use:

anova(model_lme)


numDF      denDF      F-value    p-value

(Intercept)           1                80      4123.156    <.0001

age                       1               80      114.838     <.0001

Sex                       1               25        9.292       0.0054

(The argument type would also allow us to specify marginal F-tests).

These conditional tests for fixed-effects terms require denominator degrees of freedom, which will be the focus of the next section.

Degrees of freedom

• lme

The denominator degrees of freedom correspond to “the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs” [3]. It is worth noticing that the values for these degrees of freedom do not match those provided by other software procedures such as SAS PROC MIXED (see discussions on the topic here and here).

Additionally to the denominator degrees of freedom aforementioned, conditional F-tests also require numerator degrees of freedom defined by the term (see output from previous section).

• A good explanation regarding the reporting of degrees of freedom in lmer is given by the author of the package in this article (page 28).

p-values

• lme reports p-values (see previous output), whereas
• lmer doesn’t but this has been justified by Bates.

Random effects

• lme allows for nested random effects in a very straightforward way (random=~1|a/b, where factor b is nested in a). Crossed random effects on the other hand, can be dealt with through “a combination of pdBlocked and pdldent objects” [3].
• Nested random effects can again be easily modelled in lmer (+(1|a/b)). Crossed random effects are handled in an easier way than in lme (+(1|a)+(1|b)). You can find further explanations in [2].

Random effects plots

Two different approaches to the plotting of the random effects can be obtained through the following lines of code:

• lme
plot(ranef(model_lme))

• lmer
qqmath(ranef(model_lmer))


Fig. 2. Random effects plots for model_lme and model_lmer.

Residuals plots

• lme allows to plot the residuals in the following ways:
res_lme=residuals(model_lme)
plot(res_lme)
qqnorm(res_lme)
qqline(res_lme)
plot(model_lme)


Fig. 3. Residual plots for model_lme.

• We can also plot the first two graphs in lmer, but the last line of code does not seem to work with this function.

Correlation structure

• We can easily incorporate correlation structures in lme.  Mostly used for temporal correlation structures are corAR1, corCAR1 (autoregressive and continuous autoregressive correlation structures of order 1), and corCompSymm (compound symmetry).
model_lme<-lme(distance ~ age + factor(Sex),random = ~ 1 | Subject, cor=corAR1(0.6,form=~1|Subject),data = Orthodont)


Correlation Structure: AR(1)

Formula: ~1 | Subject

Parameter estimate(s):

Phi

0.05849311

Further structures available can be found in the help for corClasses.

• There is not an argument in lmer for doing this, but guidance on how to incorporate the structures can be found here.

Heteroscedasticity

• lme allows you to model heteroscedasticity using the varFunc object,  but
•  it is not covered by lmer.

Although some other differences could be mentioned, future posts will cover those other matters.

The following books have been found extremely useful:

[1] Badiella, L., Sánchez, J.A. “Modelos mixtos con R versión 0.3” ( from the course “Modelos Mixtos utilizando R”)

[2] Bates, D. (2010). lme4: Mixed-effects modelling with R. Springer

[3] Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer

[4] Zuur, A.F. et al. (2009). Mixed effects models and extensions in ecology with R. Springer

Featured

# Side effects of open health data

Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point.

At the forefront of these changes, reputable publications have taken measures to set transparency standards.  Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012)

In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid).

Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only)

However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10th of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28th of June 2013) .

While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several  doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients.

In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.”

Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues…

Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors.

What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation?

Featured

# Interview with…Laetitia Teixeira

Laetitia is a graduate in Applied Mathematics and she also has a Master’s degree in Applied Statistics and Modelling from the University of Porto, Portugal. At present, she is a PhD student on Applied Mathematics and she works in the Research Unit UNIFAI (Institute of Biomedical Sciences Abel Salazar, University of Porto, Portugal)

1. Why do you like Biostatistics?

Biostatistics allows for the application of statistical theory to various areas of research practice and to work in various areas. Statistics and medicine are two areas of great interest for me and biostatistics allows me to work in both.

2. Could you give us some insight in your current field of research?

My PhD work focuses on survival analysis in the presence of competing risks. All the practical work is based on end-stage renal disease patients with peritoneal dialysis as renal function replacement therapy. We explore several statistical approaches, such as regression models taking competing risks into account, multistate models and joint models for longitudinal and competing risks data. Using these approaches, we can give more and better information about the disease progression, helping clinicians in the evaluation of patients and treatment planning.

Combined with my PhD, I am a fellow researcher at UNIFAI/ICBAS-UP, a Research Unit specialized in ageing and health.

3. Which are, in your opinion, the main advantages of being a researcher?

The opportunity to work in several areas with multidisciplinary teams.

4. What do you think of the situation of young biostatisticians in Portugal?

In Portugal, biostatisticians are mostly present in higher education institutions. Some public and private enterprises have been integrating some young biostatisticians, however in a very limited number. Some colleagues have gone to other European countries, where they have found better opportunities in this area.

5. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Interested in research, versatile and good communicator.

6. Which do you think are the main qualities of a good mentor?

Motivator, interested in research and dedicated, good communicator.

7. Finally, is there any topic you would like to see covered in the blog?

A list of working groups organized by research themes. This list would be important for young biostatisticians to find people according to working area and would allow students/researchers to create other networks.

Selected publications:

• Teixeira, L., Rodrigues, A., Carvalho, M.J., Cabrita, A. & Mendonça, D. (2013). Modeling Competing Risks in Nephrology Research: An Example in Peritoneal Dialysis. BMC Nephrology 2013, 14:110 doi:10.1186/1471-2369-14-110
• Cotovio, P., Rocha, A., Carvalho, M.J., Teixeira, L., Mendonça, D., Cabrita, A., & Rodrigues, A. (2013). Better Outcomes of Peritoneal Dialysis in Diabetic Patients in Spite of Risk of Loss of Autonomy for Hemodialysis. Accepted – Peritoneal Dialysis International.
• Rocha, A., Rodrigues, A., Teixeira, L., Carvalho, M. J., Mendonça, D., & Cabrita, A. (2012). Temporal Trends in Peritonitis Rates, Microbiology and Outcomes: The Major Clinical Complication of Peritoneal Dialysis. Blood Purification, 33(4), 284-291.
Featured

# Between descriptive analysis and decision making

When faced with a statistical analysis task, we usually begin with a simple description about our data that will allow us to analyze and interpret our variables, with the aim of making a decision on some hypotheses that had been made at the start of the study.

This post deals with some statistical methods that are regarded as exploration techniques, but that go further than simple and usual descriptive. This talk will focus on correspondence analysis (CA) and principal components analysis (PCA), both are central to Multivariate Analysis.

PCA and CA are usually applied in high dimensional datasets with the principal objective being to reduce the dimensions of the data. Although these methods have their own particularities, both conclude to explain latent variables in the problem through observed data.

PCA: is widely used to capture essential patterns of big datasets. In high dimensional data sometimes it is difficult for researchers to extract interesting features, so one way to solve it is to reduce its dimensionality at the expense of losing information. It works by creating new uncorrelated variables $Y_i$ (named as PCA) through linear combinations of original variables $X_k$ (in general correlated).

$Y_i = \alpha_1X_1 + \alpha_2X_2 + \ldots + \alpha_jX_j$

These PCA collect all information than original variables, and the goal is to select some PCA by preserving as much data variance as possible.

CA: unlike PCA, this methodology is applied in categorical data (without calculate linear combinations) as a procedure to analyze contingency tables. CA allows us to describe the relation between two nominal variables as well as the relation between the levels of themselves in a Cartesian axis.

The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis.

The applications of PCA and CA are wide and varied, in fields such as biology, ecology, social sciences, psychology, image processing … in which the number of variables is big. As we have said before, in that situation the PCA and CA provide us a method to extract latent variables and population intrinsic characteristics that have not been observed, so that we can think in these as a hypothesis generation system.

With ‘ca’ (‘mjca’) and ‘princomp’ packages of R we can apply Simple (Multiple) Correspondence Analysis and Principal Components Analysis into our data. The following figure illustrates a typical graphic of PCA, representing the first two components.

Here I have briefly commented a little aspects of two procedures to describe large datasets. In following posts I will try to do an example using R. Meanwhile, you can try and play with ‘ca’ and ‘princom’ functions.

Featured

# How to manage longitudinal data with R?

During the time that this blog has been running several posts about longitudinal data has been published. However, we have not talked yet about how we can deal with them with R.

As we have discussed in other posts, longitudinal data gather information from a group of study subjects over time (repeated measures). When the number of measurements is the same for all subjects and these measurements are equidistant along time, it is said that we have a balanced data. Otherwise our data are called unbalanced.

When working with either format in R, the joineR package that allows us to adapt our data. If our data are balanced we can move from one format to another according to the analysis we are interest in, simply by:

# generate a (balanced) data
simul <- data.frame(id = 1:3, Y.1 = rnorm(3), Y.2 = rnorm(3), age = runif(3,0,18))
# move it to an unbalanced format
simul.unbal <- to.unbalanced(simul, id.col = 1, times = c(1,2), Y.col = 2:3, other.col = 4)
# return data to a balanced format
simul.bal <- to.balanced(simul.unbal, id.col = "id", time.col = "time", Y.col = c("Y.1"), other.col = 4)


Once we have our data in an appropriate format, one of the first descriptive analysis to do is to get the empirical longitudinal variogram. The variogram allows us to check if within-subjects observations are related. In order to do that, we need to have the unbalanced data format and we will get it very easily by using the joineR variogram (although expect it to be a bit slow if you have a large dataset).

As an example, we will load a data set included in the Applied Longitudinal Data Analysis book by Judith D. Singer and John B. Willett and calculate the empirical variogram:

# read in data set (tolerance data from ALDA book)
vgm <- variogram(indv=tolerance$id, time=tolerance$time, Y=tolerance\$tolerance)