Featured

# p-values explained

David Blanco (UPC) recently prepared this video following the ASA statement  and we wanted to share it with you. We particularly love the very useful examples (who would not want those work colleagues?!) and the BNE Shiny application.

We also highly recommend this RSS video on the development and impact of the statement, including some very interesting discussions.

Featured

A set of online video tutorials recorded during the Cycle of Conferences “Bioestadística para Vivir” can be found here. These talks are aimed to the general public and try to show the impact of Biostatistics on everyday life, especially in the fields of health and the environment.

Biostatnet is one of the collaborating members of this project,“ Bioestadística para Vivir y Cómo Vivir de la Estadística”, which is promoted by the University of Santiago de Compostela, through its Unit of Biostatistics (GRIDECMB), and the Galician Institute of Statistics. This knowledge exchange initiative is funded by the Spanish Science & Technology Foundation (FECYT).

Other activities are being developed under this project; the exhibition Exploristica – Adventures in Statistics: an itinerant exhibition  teaching Statistics to secondary school students, Cycles of Conferences, Problem solving workshops, etc. You can find out more here .

Featured

# P-splines for longitudinal data

Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).

Contact Ipek

In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .

Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.

In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.

Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).

This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.

First of all, let`s define a linear mixed model for a longitudinal data.

As a matrix notation;

$y=X\beta+Zu+\varepsilon$

• $y$ is the vector of the observed responses ,

$\left(\begin{array}{cc}y_{1}\\\vdots\\ y_{n}\end{array}\right)=\left(\begin{array}{cc}y_{11}\\y_{12}\\\vdots\\ y_{1n_{1}}\\ y_{2_{1}}\\\cdot\\\cdot\\\cdot\end{array}\right)$

• $\beta$ is the fixed effects parameters vector,
•  $\varepsilon=\left(\begin{array}{cc}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\ \varepsilon_{N}\end{array}\right)$  is the residual vector,
•  $u=\left[\begin{array}{cc}u_{1}\\u_{2}\\\vdots\\ u_{N}\end{array}\right]$ denotes unknown individual effects.
• $Z=\left[\begin{array}{ccc}Z_{1}&\ldots&0\\\vdots&\ldots&\vdots\\ 0&\ldots&Z_{N}\end{array}\right]$  a known designed matrix linking $u$ to $y$,
• $X=\left[\begin{array}{cc}X_{1}\\X_{2}\\\vdots\\ X_{N}\end{array}\right]$

$u_{i}$, $i=1,\ldots,N$ assumed to be $N(0,G)$, independently of each other and of the $\varepsilon_{i}.\varepsilon_{i}$ is distributed as $N(0,R_{i})$ where $R_{i}$ and $G_{i}$  are positive definitive covariance matrices.

Consider a flexible regression model,

$y_{i}=f(x_{i})+\varepsilon_{i}$

$\varepsilon_{i}\sim N(0,\sigma^2)$

Where $y_{i}$ is the response variable of the observations $i=1 ,\ldots, N$ and $f(\cdot)$ is the smooth function of covariate $x$. We represent this function with a linear combination of $d$ known basis function $B_{j}.$

And we reformulate it as;

$y=B\theta+\varepsilon$

$\varepsilon\sim N(0,\sigma^2I)$

Depending on the basis that is used for the P-splines, $X$ and $Z$ have the following forms;

• Truncated polynomials

$X=\left[1,x,\ldots,x^p\right]$
$Z=\left[(x_{i}-\kappa _{k})^p_+\right]$
$1 \leqslant i \leqslant n$
$1\leqslant k \leqslant \kappa$

• B-splines

$X=[1:x]$

$Z=BU\Sigma^{-1/2}$

Where $U$  is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix  $P=D\prime D$ and $\Sigma$  is a diagonal matrix containing the eigenvalues with $q$ null eigen values (Lee, 2010).

Therefore the model becomes;

$y=X\beta+Zu+\varepsilon$

$u\sim N(0,\sigma^2_{u}I_{c-2})$

and $\varepsilon\sim N(0,\sigma^2I)$

Where $c$ is the number of columns of basis $B$, and the smoothing parameter becomes; $=\frac{\sigma^2}{\sigma^2_{u}}$ (Durban et al, 2005).

With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).

Featured

# A computational tool for applying bayesian methods in simple situations

Luis Carlos Silva Ayçaguer. Senior researcher of the Escuela Nacional de Salud Pública from La Habana, Cuba; member of the development team of Epidat. Degree in Mathematics from Universidad de La Habana (1976), PhD from Universidad Carolina (Praga, 1982), Doctor of Science from Universidad de Ciencias Médicas (La Habana, 1999), Titular Academic from República de Cuba.

Email: lcsilva@infomed.sld.cu

Soly Santiago Pérez. Technical statistician at Dirección Xeral de Innovación e Xestión da Saúde Pública (General Directorate of Public Health, Spain) from 1996 to present; member of the development team of Epidat. Degree in Mathematics from Universidad de Santiago de Compostela (Spain, 1994). Specialist in Statistics and Operational Research.

In this post, we present a user friendly tool for applying bayesian methods in simple situations. This tool is part of a free distribution software package, Epidat, that is being developed by the Dirección Xeral de Innovación e Xestión da Saúde Pública (Xunta de Galicia, Spain) since the early 90’s. The general purpose of Epidat is to provide an alternative to other statistical packages for performing analysis of data; more specifically, it brings together a broad range of statistical and epidemiological techniques under a common interface. At present, the fourth version of Epidat, developed in Java, is freely available from the web page: http://dxsp.sergas.es; to download the program, registration is required.

As stated above, one of the methods or “modules” included in Epidat 4 is Bayesian analysis, a tool for the application of bayesian techniques to basic problems, like the estimation and comparison of means and proportions. The module provides a simple approach to Bayesian methods, not based on hierarchical models that go beyond the scope of Epidat.

The module of Bayesian analysis is organized into several sub-modules with the following scheme:

• Bayes’ theorem
• Odds ratio
• Proportion
• One population
• Estimation of a proportion
• Assessment of hypotheses
• Two populations
• Estimation of effects
• Assessment of hypotheses
• Mean
• One population
• Estimation of a mean
• Assessment of hypotheses
• Two populations
• Estimation of a difference
• Assessment of hypotheses
• Bayesian approach to conventional tests

The first option of the module can be used to apply Bayes’ theorem. The following three options (Odds ratio, Proportion and Mean) are designed to solve basic inferential problems under Bayesian logic: estimation of odds ratios, proportions and means, as well as differences or ratios of these two last parameters. The punctual estimation is accompanied by the corresponding credibility interval. The techniques available in these options also include methods related to hypothesis testing. Finally, the last sub-module allows the evaluation of conventional tests from a Bayesian perspective.

Some specific options of Bayesian analysis require the use of simulation techniques. Nevertheless, the user does not need to understand the simulation process to use the program and interpret the results correctly. In addition, the module has a user-friendly interface that allows the user to plot the a priori distribution and choose the values of its parameters (see figure above). The output of Bayesian analysis includes both numerical and graphical results, and the graphics can be edited and modified by the user. In addition, the contents of the output window can be saved as a file in several formats: *.epi, *.pdf, *.odf, or *.rtf.

Finally, like all modules of Epidat, Bayesian analysis has a useful help feature which can be accessed on the help menu in pdf format. This facility has been developed with a didactic and critical approach, including statistical basics of the methods, bibliographic references, and examples of the different options.

# Another important R: Relative Risk

María Álvarez Hernández, BSc in Mathematics (University of Salamanca), is a PhD student in Statistics and Operations Research at the University of Granada, where she works with Professor Martín Andrés. Her line of research is framed within the statistical analysis of categorical data from contingency tables. Contact María

One of the common objectives of Health Sciences is to compare the proportions of individuals with a feature of interest in two different populations, for which purpose it is usual to take two independent samples. This is the case of comparing the proportion of cures with two different treatments, or the proportion of patients in the groups with and without a particular risk factor. In such situations, the parameter of interest is the difference between two proportions, but in the field of Medicine the parameter of interest is usually the ratio of two proportions. Examples about this are clinical trials which evaluate the effectiveness of a new vaccine, studies for comparing two binary diagnostic methods, studies of the comparison of two different treatments, etc.

From an exact point of view, getting a confidence interval for R is computationally very intensive, it requires specific computer programmes and it isn’t feasible for moderately large sample sizes (Reiczigel et al., 2008). Hence researchers have devoted a great attention to how to obtain approximate confidence intervals and, although many different procedures have been proposed, these have not always been compared. Nowadays, there is a general consensus that the best procedure is the score method proposed by Koopman (1984) and by Miettinen and Nurminen (1985). Alternatively, other simpler methods have been proposed which work more or less well (Farrington and Manning, 1990; Dann and Koch, 2005; Zou and Donner, 2008).

One piece of research in which I am involved is to improve these methods and to suggest  new ones that will allow us to achieve a result closer to the exact one, without losing rigor in the process (Martín and Álvarez, 2012). But although the improvement may be in a theoretical level, what happens in the computing scene?

From a practical point of view, obtaining confidence intervals for the relative risk through statistical packages such as SPSS20, Stata12 or StatXact10, also focuses on the asymptotic case, although in some of them, the researcher can actually obtain the exact confidence interval (in some situations incurring a long computational time). In general, the methods used are based on the ideas of Miettinen & Nurminen (1985) where it is assumed a standard normal distribution, Katz et al (1978) who applied the logarithmic transformation, and Koopman (1984) with the reputed score method. Sometimes, as it is the case of the StaXact software, it is allowed to apply the Berger & Boos correction because it reduces conservatism (it would result in shorter confidence intervals).

The aim must be not only to obtain the best methods in a theoretical way but also those that are more feasible when we carry out the explicit calculation and that involve shorter computational times.

Therefore, although the theory evolves, programmed routines in statistical packages to make inferences, for example about a measure of association like the relative risk, have not kept the pace like other techniques, considering that for the Health sector is a priority case.

In short, we should not be content with the implemented procedures and will spare no effort on research resources that allow us to improve them quickly and easily.

# Invitation to the XIV Spanish Biometric Conference 2013

Elvira Delgado Márquez, MSc in Applied Statistics, BSc in Computer Engineering and BSc in Statistics (University of Granada) is a PhD student at the University of Castilla-La Mancha where she works with Professor López Fidalgo and Dr. Amo Salas. Her area of expertise is Optimum Experimental Designs.

The term “Biometry” has been used to refer to the field of development of statistical and mathematical methods applicable to data analysis problems in the biological sciences. Statistical methods for the analysis of data from agricultural field experiments to compare the yields of different varieties of wheat, for the analysis of data from human clinical trials evaluating the relative effectiveness of competing therapies for disease, or for the analysis of data from environmental studies on the effects of air or water pollution on the appearance of human disease in a region or country are all examples of problems that would fall under the umbrella of “Biometrics” as the term has been historically used.

Recently, the term “Biometrics” has also been used to refer to the emerging field of technology devoted to identification of individuals using biological traits, such as those based on retinal or iris scanning, fingerprints, or face recognition. Neither the journal “Biometrics” nor the International Biometric Society areengaged in research, marketing, or reporting related to this technology. Likewise, the editors and staff of the journal are not knowledgeable in this area.

On behalf of the Spanish Biometric Society, the area of Statistics and Operations Research at the University of Castilla – La Mancha welcomes the celebration of the XIV Spanish Biometric Conference – 2013 that will be held in Ciudad Real (Spain), from the 22nd to the 24th of May, 2013.

Full information can be found at the Conference´s website  as well as contacting the following e-mail address: biometria2013@gmail.com

We invite scholars willing to promote de development and application of the mathematical and statistical methods in the areas of Biology, Medicine, Psychology, Pharmacology, Agriculture, Bioinformatics and other areas related to life sciences, to come to Ciudad Real and participate in the presentation of the latest results in these areas.

Furthermore, the Biometrical Journal (edited in cooperation with the German and the Austro-Swiss Region of the International Biometric Society), indexed in Journal Citation Reports (JCR), will publish a special issue with a selection of the papers presented at the conference.

We remain at your disposal and we look forward to welcoming you in Ciudad Real very soon.

Elvira Delgado on behalf of the organizing committee.