Featured

# P-splines for longitudinal data

Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).

Contact Ipek

In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .

Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.

In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.

Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).

This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.

First of all, let`s define a linear mixed model for a longitudinal data.

As a matrix notation;

$y=X\beta+Zu+\varepsilon$

• $y$ is the vector of the observed responses ,

$\left(\begin{array}{cc}y_{1}\\\vdots\\ y_{n}\end{array}\right)=\left(\begin{array}{cc}y_{11}\\y_{12}\\\vdots\\ y_{1n_{1}}\\ y_{2_{1}}\\\cdot\\\cdot\\\cdot\end{array}\right)$

• $\beta$ is the fixed effects parameters vector,
•  $\varepsilon=\left(\begin{array}{cc}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\ \varepsilon_{N}\end{array}\right)$  is the residual vector,
•  $u=\left[\begin{array}{cc}u_{1}\\u_{2}\\\vdots\\ u_{N}\end{array}\right]$ denotes unknown individual effects.
• $Z=\left[\begin{array}{ccc}Z_{1}&\ldots&0\\\vdots&\ldots&\vdots\\ 0&\ldots&Z_{N}\end{array}\right]$  a known designed matrix linking $u$ to $y$,
• $X=\left[\begin{array}{cc}X_{1}\\X_{2}\\\vdots\\ X_{N}\end{array}\right]$

$u_{i}$, $i=1,\ldots,N$ assumed to be $N(0,G)$, independently of each other and of the $\varepsilon_{i}.\varepsilon_{i}$ is distributed as $N(0,R_{i})$ where $R_{i}$ and $G_{i}$  are positive definitive covariance matrices.

Consider a flexible regression model,

$y_{i}=f(x_{i})+\varepsilon_{i}$

$\varepsilon_{i}\sim N(0,\sigma^2)$

Where $y_{i}$ is the response variable of the observations $i=1 ,\ldots, N$ and $f(\cdot)$ is the smooth function of covariate $x$. We represent this function with a linear combination of $d$ known basis function $B_{j}.$

And we reformulate it as;

$y=B\theta+\varepsilon$

$\varepsilon\sim N(0,\sigma^2I)$

Depending on the basis that is used for the P-splines, $X$ and $Z$ have the following forms;

• Truncated polynomials

$X=\left[1,x,\ldots,x^p\right]$
$Z=\left[(x_{i}-\kappa _{k})^p_+\right]$
$1 \leqslant i \leqslant n$
$1\leqslant k \leqslant \kappa$

• B-splines

$X=[1:x]$

$Z=BU\Sigma^{-1/2}$

Where $U$  is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix  $P=D\prime D$ and $\Sigma$  is a diagonal matrix containing the eigenvalues with $q$ null eigen values (Lee, 2010).

Therefore the model becomes;

$y=X\beta+Zu+\varepsilon$

$u\sim N(0,\sigma^2_{u}I_{c-2})$

and $\varepsilon\sim N(0,\sigma^2I)$

Where $c$ is the number of columns of basis $B$, and the smoothing parameter becomes; $=\frac{\sigma^2}{\sigma^2_{u}}$ (Durban et al, 2005).

With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).