Working with Joint Models in R

Following previous posts on longitudinal models (see posts) and interviews where mentions have been made to survival analysis (read one of these interviews), we will focus today on models incorporating both methodologies, the so-called Joint Models.

Often problems pose more than a single approach. For example, when HIV data are analyzed, we can on the one hand focus on studying the evolution of CD4 cell counts (longitudinal model) and on the other hand, we can also study the time to death (survival analysis model). Joint modeling will allow both models to share information and will achieve better results.

From a practical point of view, in R, we can use two packages called JM and JMbayes to fit these models, under either the frequentist or the Bayesian paradigm, respectively. The use of these packages is not difficult because they are based on the use of packages nlme and survival with which you might already be familiar.

Here I show you how to use them:

    • We loaded packages JM and JMbayes
    • We fit the longitudinal and survival models independently, saving the results in objects
# Fit the longitudinal model
lmeFit.aids <- lme(CD4 ~ obstime * drug, random = ~obstime | patient, data = aids)
# Fit the survival model
survFit.aids <- coxph(Surv(Time, death) ~ drug, data = aids.id, x = TRUE)
    • We use the objects lmeObject and survObject with the functions in the packages JM and JMbayes
# Fit the joint model (Frequentist)
jointFit.aidsFREQ <- jointModel(lmeFit.aids, survFit.aids, timeVar = "obstime")
# Fit the joint model (Bayesian)
jointFit.aidsBAYES <- jointModelBayes(lmeFit.aids, survFit.aids, timeVar = "obstime")

It’s possible to specify different characteristics about the joint model. With JM package various options for the survival model are available. With JMbayes it is possible to choose the type of association structure between the longitudinal and survival processes.

Finally, if you are interested in these models and are thinking about using them in your research, a good reference is the book Joint Models for Longitudinal and Time-to-Event Data: With Applications in R of Dimitris Rizopoulos, author of JM and JMbayes.

Note: The databank to fit the longitudinal model must have as many rows as observations per patient have (aids). On the other hand, the survival model requires a database that has only one observation for each patient over time to death and censoring indicator (aids.id).


How to manage longitudinal data with R?

During the time that this blog has been running several posts about longitudinal data has been published. However, we have not talked yet about how we can deal with them with R.

As we have discussed in other posts, longitudinal data gather information from a group of study subjects over time (repeated measures). When the number of measurements is the same for all subjects and these measurements are equidistant along time, it is said that we have a balanced data. Otherwise our data are called unbalanced.


When working with either format in R, the joineR package that allows us to adapt our data. If our data are balanced we can move from one format to another according to the analysis we are interest in, simply by:

# generate a (balanced) data
simul <- data.frame(id = 1:3, Y.1 = rnorm(3), Y.2 = rnorm(3), age = runif(3,0,18))
# move it to an unbalanced format
simul.unbal <- to.unbalanced(simul, id.col = 1, times = c(1,2), Y.col = 2:3, other.col = 4)
# return data to a balanced format
simul.bal <- to.balanced(simul.unbal, id.col = "id", time.col = "time", Y.col = c("Y.1"), other.col = 4)

Once we have our data in an appropriate format, one of the first descriptive analysis to do is to get the empirical longitudinal variogram. The variogram allows us to check if within-subjects observations are related. In order to do that, we need to have the unbalanced data format and we will get it very easily by using the joineR variogram (although expect it to be a bit slow if you have a large dataset).

 As an example, we will load a data set included in the Applied Longitudinal Data Analysis book by Judith D. Singer and John B. Willett and calculate the empirical variogram:

# read in data set (tolerance data from ALDA book)
tolerance <-read.csv("http://www.ats.ucla.edu/stat/r/examples/alda/data/tolerance1_pp.txt")
vgm <- variogram(indv=tolerance$id, time=tolerance$time, Y=tolerance$tolerance)


The package also allows us to make some more plotting functions and analysis of longitudinal and survival data together using random effects joint models. Certainly a very interesting package for those who deal with this type of data or  are interested in start working with them.

Try it and tell us your experience.


P-splines for longitudinal data


Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).

Contact Ipek

In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .

Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.

In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.

Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).

This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.

First of all, let`s define a linear mixed model for a longitudinal data.

As a matrix notation;


  • y is the vector of the observed responses ,

\left(\begin{array}{cc}y_{1}\\\vdots\\ y_{n}\end{array}\right)=\left(\begin{array}{cc}y_{11}\\y_{12}\\\vdots\\ y_{1n_{1}}\\ y_{2_{1}}\\\cdot\\\cdot\\\cdot\end{array}\right)

  • \beta is the fixed effects parameters vector,
  •  \varepsilon=\left(\begin{array}{cc}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\ \varepsilon_{N}\end{array}\right)  is the residual vector,
  •  u=\left[\begin{array}{cc}u_{1}\\u_{2}\\\vdots\\ u_{N}\end{array}\right] denotes unknown individual effects.
  • Z=\left[\begin{array}{ccc}Z_{1}&\ldots&0\\\vdots&\ldots&\vdots\\ 0&\ldots&Z_{N}\end{array}\right]  a known designed matrix linking u to y,
  • X=\left[\begin{array}{cc}X_{1}\\X_{2}\\\vdots\\ X_{N}\end{array}\right]

  u_{i}, i=1,\ldots,N assumed to be N(0,G), independently of each other and of the \varepsilon_{i}.\varepsilon_{i} is distributed as N(0,R_{i}) where R_{i} and G_{i}  are positive definitive covariance matrices.

Consider a flexible regression model,


\varepsilon_{i}\sim N(0,\sigma^2)

Where y_{i} is the response variable of the observations i=1 ,\ldots, N and f(\cdot) is the smooth function of covariate x. We represent this function with a linear combination of d known basis function B_{j}.

And we reformulate it as;


\varepsilon\sim N(0,\sigma^2I)

Depending on the basis that is used for the P-splines, X and Z have the following forms;

  • Truncated polynomials

Z=\left[(x_{i}-\kappa _{k})^p_+\right]
1 \leqslant i \leqslant n
1\leqslant k \leqslant \kappa

  • B-splines



Where U  is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix  P=D\prime D and \Sigma   is a diagonal matrix containing the eigenvalues with q null eigen values (Lee, 2010).

Therefore the model becomes;


u\sim N(0,\sigma^2_{u}I_{c-2})

and \varepsilon\sim N(0,\sigma^2I)

Where c is the number of columns of basis B, and the smoothing parameter becomes; =\frac{\sigma^2}{\sigma^2_{u}} (Durban et al, 2005).

With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).

The complex structure of the longitudinal models

Two weeks ago, we started to talk in this blog about longitudinal data with the post by Urko Agirre. This type of data involves complex structure models called longitudinal models.

Longitudinal studies have two important characteristics:

  1. They are multivariate because for each studied individual many temporal measurements from the response variable (and covariates) are collected.

  2. They are multilevel as the variables measured are nested within the subjects under study, therefore resulting in layers.

These characteristics allow us to make inference about the general trend of the population as well as about the specific differences between subjects that can evolve in another way regarding the overall average behavior.

At the beginning of the 20th century this type of data started to be modelled. Different proposals appeared such as ANOVA models (Fisher, 1918), MANOVA models (generalised from ANOVA models to multivariate) or growth curves (Grizzle and Allen, 1969). All these proposals showed improvements in some aspects. However, they left some others unresolved. On the one hand, ANOVA models are univariate and our data are multivariate. On the other hand, MANOVA models are multivariate but assume  independence between intra-subject observations (observations from the same individual are not independent in general). Finally, the last option, growth curves, contemplate intra-subject observations dependence but are too restrictive on the design matrix.

It was not until the early 80s when a proposal that included all the aspects of these complex data appeared. Laird and Ware proposed the application of linear mixed models (LMM) in the paper “Random-Effects Models for Longitudinal Data“.

The basic structure of these LMM of each patient i is:

y_i = X'_i\beta + Z'_ib_{i} + W_i(t_i) + \epsilon_i


  • y_i=(y_{i1},...y_{in_i}) is the vector of measurements of the response variable, made to ith a total of m subjects in times t_i=(t_{i1},...t_{in_i}). n_i is the number of repeated measured of the ith patient.

  • X'_i\beta represents the deterministic model, being X'_i the submatrix design covariates associated with the ith individual, and \beta its associated parameter vector.

  • Z'_ib_i are random effects responsible for capturing the variability between individuals.

  • W_i(t_i) includes intra individual variability, ie the variability between observations of the same subject.

  • Finally, \epsilon_i reflects the variability that is not due to any kind of systematic error that we can determine.

To specify each of these elements it is essential to first do the descriptive graphics presented in this previous post.

We have learned more about repeated measures and in next posts we will talk  more about that because it is only the beginning. To be continued!!!

From Descriptive to Repeated Measures Data….one small step for studies, one giant leap for (bio)statistics

Traditional epidemiological descriptive studies, also called cross-sectional  studies,  have been characterized for reporting population health, describing the existing distribution of the collected exposure factors, variables, without relating to other hypotheses.  In other words, they should try to give an answer to three basic “W” questions: who, where and when. Most important uses of this kind of research include health planning and hypothesis generation. Nonetheless, the most important pitfall is that researchers might draw causal inferences when developing this type of studies. Temporal associations between the effects and the outcomes of interest might be unclear. Thus, when a researcher wants to verify the causality effect between two variables, a more appropriate design is highly recommended, such as a study with two or more observations per subject collected over the established research period. The latter design corresponds to repeated measurement data structure, more specifically, to a longitudinal data analysis (a common repeated analysis form in which measurements are recorded on individual subjects over time).

As mentioned in the previous paragraph, the main difference between both research study designs, cross-sectional and longitudinal, is that each experimental unit participating in the first one is observed only once, so for each exposure factor one has only one value per subject. In other words,  each row in the dataset is an observation. However, in longitudinal data each subject is observed  more than once.

It is also worth pointing out an increase in the complexity of the statistical approaches when moving from descriptive analysis to repeated data studies. For instance, in the first setting the statistical methods in use are the simplest ones: mean and percentage comparisons by means of classical tests, regression analysis, etc…However, in repeated measures data sets, and specifically in longitudinal data analysis, is required to  use special statistical techniques for valid analysis and inference. Thus, researchers should be aware of three important points to perform a proper statistical model, in this order:  (1) the trend of the temporal component; (2) the variance-covariance structure; (3) the mean structure. More accurately, the overall trend of the evolutive analysis should be guessed first of all. Temporal trends can follow a linear, quadratic, cubic or even a fourth grade polynomial function. Besides, as observations in the same subject are more likely to be correlated, repeated measures analysis must account for this correlation (the within and between-subject effects must be controlled). Among the possible covariance structures, compound symmetry, unstructured  and  first-order autoregressive  are the most used.  As for the mean structure, the potential exposure factors which could be related with the dependent variable should be included in the model.


Longitudinal studies play an important key role, mostly in epidemiology and clinical research. They are used to determine the change in the outcome of measurement or to evaluate the effectiveness of a new treatment in a clinical trial, among other applicable settings. Under these scenarios, due to the complexity of the statistical analyses,  longitudinal studies involve a great deal of effort, but they offer several benefits. The most importants, from my point of view, are the following: (1) The ability to measure change in outcomes and/or exposure at the individual level, so that the researcher has the opportunity to observe individual patterns of change; (2) the temporal order of the exposure factors and the outcomes is measured. Therefore, the timing of the outcome onset can be correlated with the covariates.

Finally, there is no specific statistical package to perform this kind of analyses. Nowadays, most of them include on their recent releases all the procedures to perform, at least, a basic longitudinal analysis. Now, there is no excuse for identifying a repeated/longitudinal analysis from a descriptive one, and developing them without any doubt…. Do you agree??