Featured

# …a scientific crowd

While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.

Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.

The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”

Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here).

Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)

In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).

Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17th of May 2013.

Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.

That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!

Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!

Featured

# A computational tool for applying bayesian methods in simple situations

Luis Carlos Silva Ayçaguer. Senior researcher of the Escuela Nacional de Salud Pública from La Habana, Cuba; member of the development team of Epidat. Degree in Mathematics from Universidad de La Habana (1976), PhD from Universidad Carolina (Praga, 1982), Doctor of Science from Universidad de Ciencias Médicas (La Habana, 1999), Titular Academic from República de Cuba.

Email: lcsilva@infomed.sld.cu

Soly Santiago Pérez. Technical statistician at Dirección Xeral de Innovación e Xestión da Saúde Pública (General Directorate of Public Health, Spain) from 1996 to present; member of the development team of Epidat. Degree in Mathematics from Universidad de Santiago de Compostela (Spain, 1994). Specialist in Statistics and Operational Research.

In this post, we present a user friendly tool for applying bayesian methods in simple situations. This tool is part of a free distribution software package, Epidat, that is being developed by the Dirección Xeral de Innovación e Xestión da Saúde Pública (Xunta de Galicia, Spain) since the early 90′s. The general purpose of Epidat is to provide an alternative to other statistical packages for performing analysis of data; more specifically, it brings together a broad range of statistical and epidemiological techniques under a common interface. At present, the fourth version of Epidat, developed in Java, is freely available from the web page: http://dxsp.sergas.es; to download the program, registration is required.

As stated above, one of the methods or “modules” included in Epidat 4 is Bayesian analysis, a tool for the application of bayesian techniques to basic problems, like the estimation and comparison of means and proportions. The module provides a simple approach to Bayesian methods, not based on hierarchical models that go beyond the scope of Epidat.

The module of Bayesian analysis is organized into several sub-modules with the following scheme:

• Bayes’ theorem
• Odds ratio
• Proportion
• One population
• Estimation of a proportion
• Assessment of hypotheses
• Two populations
• Estimation of effects
• Assessment of hypotheses
• Mean
• One population
• Estimation of a mean
• Assessment of hypotheses
• Two populations
• Estimation of a difference
• Assessment of hypotheses
• Bayesian approach to conventional tests

The first option of the module can be used to apply Bayes’ theorem. The following three options (Odds ratio, Proportion and Mean) are designed to solve basic inferential problems under Bayesian logic: estimation of odds ratios, proportions and means, as well as differences or ratios of these two last parameters. The punctual estimation is accompanied by the corresponding credibility interval. The techniques available in these options also include methods related to hypothesis testing. Finally, the last sub-module allows the evaluation of conventional tests from a Bayesian perspective.

Some specific options of Bayesian analysis require the use of simulation techniques. Nevertheless, the user does not need to understand the simulation process to use the program and interpret the results correctly. In addition, the module has a user-friendly interface that allows the user to plot the a priori distribution and choose the values of its parameters (see figure above). The output of Bayesian analysis includes both numerical and graphical results, and the graphics can be edited and modified by the user. In addition, the contents of the output window can be saved as a file in several formats: *.epi, *.pdf, *.odf, or *.rtf.

Finally, like all modules of Epidat, Bayesian analysis has a useful help feature which can be accessed on the help menu in pdf format. This facility has been developed with a didactic and critical approach, including statistical basics of the methods, bibliographic references, and examples of the different options.

Featured

# R2wd package: another tool to create reports

The last post published in FreshBiostats, by Hèctor Perpiñán, was about reports combining the use of R and Latex. In it, he explained how to transport tables and figures from the R output to Latex.

Well then, this week I shall continue talking about another tool to write reports when we do the statistical analysis with R. It allows us to reduce the time spent to draft the reports and be more accurate by minimising human copying errors. I am going to focus my attention on “R2wd” package to write MS-Word documents from R.

Although most statisticians are used to work daily with Latex which results in high quality appearance, in most cases our clients have never heard about it and prefer to work with MS-Word documents, which are easier to handle.

The R2wd package needs either the statconnDCOM server (via the rcom package) or the RDCOMClient to communicate with MS-Word via the COM interface. Once this interface is installed (on Windows OS), we can create a document using all functions available in this package: wdGet, wdTitle, wdSection, wdType, wdSetFont, wdTable (the object must be a data frame or an array), wdPlot, wdApplyTheme, etc.

Not only do these functions allow us to introduce tables or figures from R output, but can also be used to inject text elements into Word report, themes and templates, different type of text (normal, italic, Verbatim,…), etc.

The following code shows a little example on how to use this package:

</pre>
### An easy example to use "R2wd" package ###

library("R2wd") #of course, you must install it before!
#require("rcom")

wdGet() #To open a new Word document

#It is possible that we get an error message with wdGet() function.
#In this case we will need to install:
#installstatconnDCOM()

wdTitle("We introduce here our file title")
wdSection("First section begins here")

wdBody("Inserts text in 'Body' style at the current cursor point in Word.")
wdWrite("The function wdBody is similar to 'wdWrite'",paragraph=TRUE)
wdType("Now we use an italic type and a centering text",italic=TRUE,alignment="center")
wdNormal("To return a Normal type")
wdWrite("We can also insert footnotes")
wdInsertFootnote("insert a footnote") #you can insert footnotes in the
#word document at the cursor position.

wdSection("Now we include another section")
wdSubsection("A toy example")

wdBody("Insert tables of results")

age <- c(23,12,45,34,18,41)
gender <- c(1,1,0,1,0,0)
height <- c(172, 150,169,180,188,160)
data <- data.frame(age,gender,height)

wdTable(data,autoformat=2)#three possibilities to autoformat parameter
wdItemize()
wdWrite("insert a text in the bullet",paragraph=TRUE)
wdItemize(Template=3)
wdWrite("we finish the description",paragraph=TRUE)
wdBody("To finish this easy example we include a plot in our Word document")

wdPlot(data$age,data$height,col="red",pch=20,height = 10, width =10, pointsize = 25)

wdSave()
wdQuit() #To close the session



As you have seen, besides being able to format our report as we want (although it may seem cumbersome to do from R), R2wd is really useful because with an only function – wdTable() – and one “click” I can copy a large table in my report with no mistakes.

You will find more functions here. Dare to try it! You will save plenty of time!

Featured

# Some tools to do a report (with Latex)

In another post I have already talked about different topics not directly associated to Statistic. At this time I will write about reports combining the use of R and Latex. Specifically about how to export tables to Latex and insert latex typesetting in R figures.

Obviously our technical work is very important but the final presentation which should be taken care of. Sometimes the problem is to move our results from the statistical software to the text editor so I will talk in this post about my favourite tools to perform this task.

When looking for a text editor to do a report, there are may options to choose between (MS Word, Latex, knitr/Sweave, Markdown, etc.). I pick out Latex versus MS Word given its possibility to easily include math language. The connection between knitr/Sweave and Markdown with R is straightforward (these two software imbibe R code), in future posts I will talk about them.

R tables to Latex

As I commented  I first work with R and then move the results to Latex. One of the simplest and most effective tools is the function xtable() that we can find in the package of the same name, xtable.

• xtable(), this function creates the code that we write in latex from an R table.

Other important packages in R with extra functionalities to convert object from R to Latex code are:

• Hmisc: you can use the function latex() to create a tex file.
• texreg: highly configurable and designed to convert model summary outputs.

Figures

Many people will know the graphics of Latex, called TikZ (see page examples), but what is not so known is that this type of graphics allow a wonderful connection between R and Latex. It allows that formulas, numbers, etc. displayed in the format of choice in Latex. To do this we simply have to follow the next steps:

1. In R: Create the tikz images from R using tikz() function, see this, in package tikzDevice ( install.packages(“tikzDevice”, repos=”http://R-Forge.R-project.org&#8221;) ). This function will create a tex file that can be compiled into PDF, but the only thing you need to do is to include the tex file into your Latex document.
2. In Latex: Insert in Latex document with function \input{normal.tex} and \input{symbol-regression.tex}. The R figure uses directly your format in Latex. You need include on preambule of Latex \usepackage{tikz}.

Will you tell us your “tricks” when making reports?

Featured

# Dealing with strings in R

As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.

Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.

Some R functions I have found quite useful when handling this data include the following ones:

• colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
• grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern

As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.

Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:

• str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
• fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions

Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.

This is just a brief summary of some options availabe in R. Any other tips on string handling?

Featured

# Interview with…Moisés Gómez Mateu

Moisés Gómez Mateu is a PhD student at the Universitat Politècnica de Catalunya (UPC) where he works as research assistant.

Contact Moisés

1. Why do you like Biostatistics?

I have been very curious since I was a child; that’s why I like statistics. Moreover, statistics related to biology and medicine allows helping people to improve their quality of life.

2. Could you give us some insight in your current field of research?

I focus my thesis on survival analysis, especially on the issue of composite endpoints in clinical trials. The main aim is to analyze what is the best primary endpoint to use, extend the statistical theory, and make practical tools available to researchers by means of a library in R, an on-line platform, etc.

3. Did you find it difficult to move from the private sector to the University?

No. In fact, I left my job as a consultant in a marketing research company to study the MSc Statistics at the UPC, and I think it was a very good decision.

4. Which are, in your opinion, the main advantages of being a researcher?

It is very satisfactory. The results you get or the research you conduct have nothing to do with the private sector. One usually investigates issues that are related to thing you like and to help improving science in general, not only to earn money.

5. What do you think of the situation of young biostatisticians in Spain?

The reality is that several colleagues and friends are working-studying abroad or looking for opportunities …

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Curiosity, Analytical skills and Creativity.

7. Which do you think are the main qualities of a good mentor?

Expertise, Modesty and Open-mindedness.

Selected publications:

• Gómez G, Gómez-Mateu M, Dafni U. Informed Choice of Composite Endpoints in Cardiovascular Trials.Submitted.
• Gómez G, Gómez-Mateu M. The Asymptotic Relative Efficiency and the ratio of sample sizes when testing two different null hypotheses. Submitted.
Featured

# Hierarchical models: special structures within repeated measures models.

Living creatures tend to organize their lives within structured communities such as families, schools, hospitals, towns or countries. For instance, students of the same age living in a town could be grouped into different classes according to their grade level, family income, school district and other features of interest. Other examples related with health care workers and patients show clear hierarchical data structures as well.
Hierarchical or nested structures (usually known as HLM) are very common throughout many research areas. The starting point of this data pattern was set up in the field of social sciences. Most research studies in this area were focused on educational data, where the main interest was to examine the relationship between inputs such as students, teachers or school resources, and student outcomes (academic achievement, self-concept, career aspirations…). Under this scenario, researchers emphasized that individuals who are drawn from an institution (classroom, school, town, …) will be more homogenous than subjects randomly sampled from a larger population: students belonging to the same classroom have the experience of being part of the same environment (places, district, teachers,…) and experiences. Due to this fact, observations based on these individuals are not fully independent.

As noted, hierarchical models consider that exist dependency among observations within the same study unit. Till last decades, owing to the lack of software development, ordinary least squares regression (OLSR), classical regression, has been used to estimate the aforementioned relationships. On consequence, results obtained from OLSR show too small standard errors leading to a higher probability of rejection of a null hypothesis than if: (1) an appropriate statistical analysis was performed; (2) data included truly independent observations. It is clear that the main issue that researchers must address is the non-independence of the observations.
Hierarchical modeling is similar to OLSR. It can be seen as an extension of the classical regression, where at least two levels are defined in the predictive model. On the base level (also called the individual level, referred as level1), the analysis is similar to OLSR: an outcome variable is defined as function of a linear combination of one or more independent level 1 variables:

$Y_{ij} = \beta_{0j}+X_{1}\beta_{1j} +\ldots+\beta_{kj}X_{k}+\epsilon_{ij}$

where $Y_{ij}$ is the value of the outcome variable of the $i$th individual of group $j$, $\beta_{0j}$ represents the intercept of group $j$, $\beta_{1j}$ is the slope of variable $X_{1}$ of group $j$. On subsequent levels, the level 1 slope and intercept become dependent variables being predicted from level 2 variables:

$\beta_{0j} = \delta_{00}+W_{1}\delta_{01} +\ldots+\delta_{0k}W_{k}+u_{0j}$
$\beta_{1j} = \delta_{10}+W_{1}\delta_{11} +\ldots+\delta_{1k}W_{k}+u_{1j}$

Though this process, it is possible to model the effects of level 1 variables and level 2 variables on the desired outcome. In the figure of this post, one can observe that there are three main levels: patients (level1) belong to hospitals (level 2) where, at the same time, hospitals are located in certain neighborhoods (level 3).

This kind of modeling is essential to account for individual – and group level variation in estimating group-level regression coefficients. However, in certain cases, the classical and HLM approaches coincide: (1) when there is very little group-level variation and (2) when the number of groups is small and consequently, there is not enough information to accurately estimate group-level variation. In this setting, HLM gain little from classical OLRS.

Now, it is your opportunity. You know where is worth the effort of applying the HLM methods instead of classical regression.

Featured

# A pinch of OR in Biostatistics

The first thing in common between Operational Research (OR) and Biostatistics, is that both terms are often misunderstood. OR, or the Science of Better, as it is commonly known, has lots to do with optimization but there is much more to it…Some might be surprised to read that many of the tools of this discipline can be -and are actually- applied to problems in Biostatistics.

It all starts with Florence Nightingale  volunteering as a nurse in the Crimean war in 1853…Her pioneering use of statistics for evidence-based medicine, and her operational research vision to reform nursing, led to a decline in the many preventable deaths that occurred throughout the nineteenth century in English military and civilian hospitals.

Since then, integrated approaches are commonplace: disease forecasting models, neural networks in Epidemiology, discrete-event simulation in clinical trials,…

Figure: Genetic algorithm representation using R. Find here the code for the original dynamical plot.

There is also an increasing interest in computational biology in OR. Examples of application of these techniques vary from linear programming based flux balance analysis models for cell metabolism studies, to sparse network estimation in the area of Genetics.

In R, there are many packages with which to apply these techniques, from genalg for Genetic algorithms (used to recreate figure above) or games on Games theory, to more specific packages in Bioconductor like CellNOptR, survcomp or OLIN. Also, a general Task View on Optimization can be found here.

Finally, a quick mention to the YoungOR 18 Conference that will be held in Exeter (UK), 9-11 April 2013. It will be covering topics of common interest for biostatisticians with streams on Health, Sustainability, etc. Plenty of inspiration for future posts!

Have you ever used any of these techniques?Any particular tips that you want to share?Tell us about it!

Featured

# Biostatistics in a Health Services Environment

Aurora Baluja is MD in the Department of Anesthesiology, Intensive Care and Pain Management of the Hospital Clinico Universitario of Santiago de Compostela, and PhD candidate at the Forensic Medicine Unit of the University of Santiago de Compostela (Spain).

As a medical doctor involved in patient management, -in operating room and in the ICU-, every day I witness the huge load of data that has to be processed, interpreted, and stored. In this situation, Biostatistics and Bioinformatics support becomes greatly important to analyse trends and patterns, that  ultimately lead to improve patient care.

I do research about risk profiles of mortality in the ICU, and given the amount of data obtained, the task of reporting my results and ensure reproducibility became very time-consuming.

In this post I intend to comment on the main tools that helped me so much to overcome those obstacles:

•  R: for me, the possibility of learning a statistical computing language to write and recycle scripts, was definitely the way. R is a powerful, open source programming language and software environment, that allows a full variety of analyses and… beautiful graphs.
• RStudio: to find a good IDE  for R code was also important. Fortunately, Altea recommended me this easy-to-use, visual piece of cake, with integrated options to report code and results… Thanks!
• R Markdown is a quick, easy method to write scripts and to report clean code and results, with some formatting options, in an HTML page. It is implemented in R-studio by the package “knitr”. Just one click away, you can generate a complete, dynamic report with text and graphics.
•  LaTeX: the main reason for me to use LaTeX is not writing documents with beautiful symbols and equations, but to somehow link my R console to a document editor, in order to generate tables and reports, just like I do with R Markdown. For this, I need to use R-studio again,  and 4 more ingredients:
1. A LaTeX distribution: I have installed TeX Live for Linux, and MiKTeX for MS Windows. A variety of templates, included often in distributions, make the first approach quite easy. Beyond the use of Sweave (see below), I became so fond of LaTeX that now it is my favourite text editor for documents, posters and slides!
2. R Sweave is the link between R-scripts and LaTeX, making possible to write an entire LaTeX document with dynamic results from R commands embedded in it.
3. Texreg: maybe the reason why I’ll never quit using LaTeX. Its magic begins after you have run several models of your data, and you are trying to see and compare *all* of them at a glance. It generates latex-formatted tables with your models, ready to paste into a LaTeX document. As a tip, I often use them preserving their \{tabular} environments and customising \{table} options to fit my document style.
4. Xtable: another R package, that allows to print nice tables in LaTeX format.

I encourage those who haven’t used any of these tools, to give them a try… surely they will help you!

Featured

# Measuring Disease Trends through Social Media

Nichole Knupp is a freelance blogger and marketing professional who writes guest posts and website content for a variety of different blogs and niches that interest her.

People in the field of Medicine, those in Master´s in Public Health programs (see examples here and here) or in other disciplines such as Biostatistics (see previous post on the topic) might wonder how social media can play a valid role in their career.  While social media has taken the world by storm, it has mostly been thought of as a tool for sharing personal information, marketing, or keeping up on the latest gossip.  However, after this past flu season, those in the medical and scientific fields are finding a whole new way to use information that is trending.  Here are just a few of the things that social media is doing to benefit our health and sciences community.

Measuring Disease Trends
One of the terms being used for those watching and tracking the spread of illnesses through social media is “social media analytics” (see for instance, social network analysis).  Quite simply, it is tracking outbreaks of various viruses through mining social media.  This past season, people were following the spread of influenza through a flu hash tag (#flu).  Viruses such as H1N1 and the Swine Flu were able to be monitored on Twitter.  Further, they found when information was extracted efficiently, it was accurate and there was even a possibility of forecasting further trends.  The most pertinent information though, was what was happening in the present.  Different agencies were able to find what area was the hardest hit and who was at risk.

Public Interest and Concerns
The health and sciences community also found social media to be a great platform in finding out how interested the public was in disease outbreaks and trends.  It was also a way to gather concerns so various agencies could address them.  Social media is utilized by people as a way to have their voices heard through an outlet that is part of their daily lives.  While some people are unwilling to sit down and answer a survey, important information can be gathered by what is mentioned on daily Twitter streams.

Disease Outbreaks and Travel
Since social media is a tool that is being used worldwide, it has also been helpful in tracking outbreaks abroad.  US citizens, and others traveling, can easily be informed through public social media announcements.  Currently there are conversations happening on how best to use the information gathered on social media, and what role it will play in informing the public and addressing questions and concerns. (A very interesting article on the topic can be found here).

Social Media Sources
While Twitter has been the main source used for tracking trends, Instagram, Facebook, and Google Trends also play a part.  Some places to look if you’re interested in seeing how it all works is checking out the HealthMap stream on Twitter (@healthmap), do a Google Trends “flu” search, or try a flu hash tag search (#flu) on Twitter.

It’s not hard to imagine the possibilities and many other things we will be able to track in the future, but one thing is already certain— there is a use for social media if you work in the medical sciences, Public Health, Biostatistics,…

Featured

# Do we need Spatial Statistics? When?

Spatial Statistics, What is it? and Why use it?

The approach taken in this post is to offer an introduction of basic and main concepts of spatial data analysis as well as the importance of its utilization in some areas like epidemiology or ecology among others. But, before to introduce a definition of spatial statistics and some concepts about this field, I consider relevant to mention some situations in which our data need to be seen as ‘spatial data’.

It is possible that people associate spatial statistics with analysis that contain numerous maps. However, it goes beyond creating these, in fact spatial data analysis is subject to internal structure of the observed data. We therefore have to be careful with the questions not directly answered by looking at the data.

We could make a long list of areas where we can apply spatial statistics: epidemiology, agriculture, ecology, environmental science, geology, meteorology, oceanography,… even econometrics. In all of these we could ask questions like the following to recognize if our data have a spatial structure:

• Does the distribution of cases of a disease form a pattern in space?Could we relate health outcomes to geographic risk factors?
• Do they influence environmental and geographical factors in the distribution and variability in the occurrence of fishery species?

But, how can we explain what is spatial statistics? I am sure that we could find infinite definitions of spatial analysis or spatial statistics. We can say that spatial statistics is responsible for analyzing the variability of random phenomena that are linked with their geographic locations. It is used to model the dependence of the georeferenced observations (these can be point observations or areal data).

Depending on the type of data and the purpose of the spatial analysis itself, we classify spatial data sets into one of three basic types: geostatistical data, lattice data (or areal data) and point pattern data. The main dissimilarity between these resides in the set of observations. Let us look at this briefly with some examples:

• Geostatistics: the geostatistical data are represented by a random vector $Y(s)$, where the locations $s \in D$ varies continuously over a fixed observational region ($D \subset \Re^{r}$). These data are characterized by spatial dependence between the locations, and the main objective in application of geostatistics is to do predictions in unobserved locations from study region. kriging is the best known technique for prediction in geostatistical data. Some examples of this data are: occurrence of species in a region, annual acid rain deposition in a town, etc.

• Lattice data: the fixed subset ($D \subset \Re^{r}$) of observations are located in a continuous region (it can have a regular or irregular shape). It is partitioned into a finite number of geographical areas with well-defined boundaries. A characterization of this type of spatial data is that neighbouring areas are usually more similar than distant ones. An example can be the observed locations from agricultural field trials (here the plots are a regular lattice).
• Point pattern data: this is the last type of spatial data where $D \subset \Re^{r}$ is itself random. The own locations determined phenomena that occurred randomly in one place, thus generating spatial patterns. One example of point pattern data is the locations of a certain species of tree in a forest (here only the locations are thought of as random).

The above explanation have only been a briefly description of three main types of spatial data. There are two basic methodologies to carry out spatial analysis: through classical statistics or by Bayesian approach (using mainly hierarchical Bayesian methods). The latter will be dealt with in more detail in future posts, given their importance as they can be applied in many situations, like spatial statistics (you can see the book ‘Hierarchical Modeling and Analysis for Spatial Data by Banerjee et al.’ for more information about this).

Featured

# The complex structure of the longitudinal models

Two weeks ago, we started to talk in this blog about longitudinal data with the post by Urko Agirre. This type of data involves complex structure models called longitudinal models.

Longitudinal studies have two important characteristics:

1. They are multivariate because for each studied individual many temporal measurements from the response variable (and covariates) are collected.

2. They are multilevel as the variables measured are nested within the subjects under study, therefore resulting in layers.

These characteristics allow us to make inference about the general trend of the population as well as about the specific differences between subjects that can evolve in another way regarding the overall average behavior.

At the beginning of the 20th century this type of data started to be modelled. Different proposals appeared such as ANOVA models (Fisher, 1918), MANOVA models (generalised from ANOVA models to multivariate) or growth curves (Grizzle and Allen, 1969). All these proposals showed improvements in some aspects. However, they left some others unresolved. On the one hand, ANOVA models are univariate and our data are multivariate. On the other hand, MANOVA models are multivariate but assume  independence between intra-subject observations (observations from the same individual are not independent in general). Finally, the last option, growth curves, contemplate intra-subject observations dependence but are too restrictive on the design matrix.

It was not until the early 80s when a proposal that included all the aspects of these complex data appeared. Laird and Ware proposed the application of linear mixed models (LMM) in the paper “Random-Effects Models for Longitudinal Data“.

The basic structure of these LMM of each patient $i$ is:

$y_i = X'_i\beta + Z'_ib_{i} + W_i(t_i) + \epsilon_i$

where

• $y_i=(y_{i1},...y_{in_i})$ is the vector of measurements of the response variable, made to ith a total of $m$ subjects in times $t_i=(t_{i1},...t_{in_i})$. $n_i$ is the number of repeated measured of the ith patient.

• $X'_i\beta$ represents the deterministic model, being $X'_i$ the submatrix design covariates associated with the ith individual, and $\beta$ its associated parameter vector.

• $Z'_ib_i$ are random effects responsible for capturing the variability between individuals.

• $W_i(t_i)$ includes intra individual variability, ie the variability between observations of the same subject.

• Finally, $\epsilon_i$ reflects the variability that is not due to any kind of systematic error that we can determine.

To specify each of these elements it is essential to first do the descriptive graphics presented in this previous post.

We have learned more about repeated measures and in next posts we will talk  more about that because it is only the beginning. To be continued!!!

Featured

# Mining genomic databases with R

When dealing with genomic data, retrieving information from databases is a required endeavor.  Most of these repositories of genomic and proteomic data have very useful and friendly interfaces or browsers perfectly suitable for particular searches. But if you are working with high dimensional data e.g. NGS data, more efficient tools to mine those databases are required. R offers several packages that make this task straightforward.

NCBI2R is an R package to annotate lists of SNPs, genes and microsatellites. It obtains information from NCBI databases. This package provides quite useful functions for getting a quick glance at a given set of SNPs or genes, retrieving SNPs in a gene or getting genes belonging to a KEGG pathway (among some other features) as shown in the code below.

library(NCBI2R)
GetIDs("MAPT[sym]")
GetGeneInfo(4137)
mysnps <- GetSNPsInGenes(4137)
GetSNPInfo(mysnps)
myset <- GetIDs("KEGG pathway:Alzheimer´s disease")


biomaRt package, part of Bioconductor project, is an R interface to the well-known BioMart data management system that provides access to several data sources including Ensemble, HGNC, InterPro, Reactome and HapMap.

Most of analysis can be performed with just two functions: useMart() to define the dataset, and getBM() to perform the query. It performs onlique queries baed on attributes – values we are interested in retrieving-,  filters – restrictions on the query- and a given set of values of the filter. Once we define the dataset we are interested in, we can check all the filters and attributes available. Let´s say we want to look for those genes associated with an OMIM phenotype.

source( "http://www.bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
listMarts()
mymart <- useMart("ensembl",dataset = "hsapiens_gene_ensembl")
listFilters(mymart)
listAttributes(mymart)
getBM(attributes = c("entrezgene", "hgnc_symbol", "mim_morbid_description"), filters ="entrezgene", values = myset, mart = mymart)


Org.Hs.eg.db , which is also part of Bioconductor, is an organism specific package that provides annotation for the human genome. It is based on mapping using Entrez Gene identifiers. In a very similar way to biomaRt, it allows to query  the databases by means of the select() function specifying cols – kind of data that can be returned-, keytypes -which of the columns can be used as keys-  and a given key.

source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
cols(org.Hs.eg.db)
select(org.Hs.eg.db, keys= "MAPT", cols = c("SYMBOL", "CHR", "CHRLOC", "UNIPROT"), keytype = "SYMBOL")


These three packages offer in many cases a good alternative to other well-known data mining systems such as UCSC Table Browser data retrieval tool. Besides, commands from these packages are quite simple, so even those not that familiar with R language can take good benefit from them.

Just a quick tip to finish – I have just found out about fread() function (data.table package) for reading large files. It really makes a difference!

Featured

# From Descriptive to Repeated Measures Data….one small step for studies, one giant leap for (bio)statistics

Traditional epidemiological descriptive studies, also called cross-sectional  studies,  have been characterized for reporting population health, describing the existing distribution of the collected exposure factors, variables, without relating to other hypotheses.  In other words, they should try to give an answer to three basic “W” questions: who, where and when. Most important uses of this kind of research include health planning and hypothesis generation. Nonetheless, the most important pitfall is that researchers might draw causal inferences when developing this type of studies. Temporal associations between the effects and the outcomes of interest might be unclear. Thus, when a researcher wants to verify the causality effect between two variables, a more appropriate design is highly recommended, such as a study with two or more observations per subject collected over the established research period. The latter design corresponds to repeated measurement data structure, more specifically, to a longitudinal data analysis (a common repeated analysis form in which measurements are recorded on individual subjects over time).

As mentioned in the previous paragraph, the main difference between both research study designs, cross-sectional and longitudinal, is that each experimental unit participating in the first one is observed only once, so for each exposure factor one has only one value per subject. In other words,  each row in the dataset is an observation. However, in longitudinal data each subject is observed  more than once.

It is also worth pointing out an increase in the complexity of the statistical approaches when moving from descriptive analysis to repeated data studies. For instance, in the first setting the statistical methods in use are the simplest ones: mean and percentage comparisons by means of classical tests, regression analysis, etc…However, in repeated measures data sets, and specifically in longitudinal data analysis, is required to  use special statistical techniques for valid analysis and inference. Thus, researchers should be aware of three important points to perform a proper statistical model, in this order:  (1) the trend of the temporal component; (2) the variance-covariance structure; (3) the mean structure. More accurately, the overall trend of the evolutive analysis should be guessed first of all. Temporal trends can follow a linear, quadratic, cubic or even a fourth grade polynomial function. Besides, as observations in the same subject are more likely to be correlated, repeated measures analysis must account for this correlation (the within and between-subject effects must be controlled). Among the possible covariance structures, compound symmetry, unstructured  and  first-order autoregressive  are the most used.  As for the mean structure, the potential exposure factors which could be related with the dependent variable should be included in the model.

Longitudinal studies play an important key role, mostly in epidemiology and clinical research. They are used to determine the change in the outcome of measurement or to evaluate the effectiveness of a new treatment in a clinical trial, among other applicable settings. Under these scenarios, due to the complexity of the statistical analyses,  longitudinal studies involve a great deal of effort, but they offer several benefits. The most importants, from my point of view, are the following: (1) The ability to measure change in outcomes and/or exposure at the individual level, so that the researcher has the opportunity to observe individual patterns of change; (2) the temporal order of the exposure factors and the outcomes is measured. Therefore, the timing of the outcome onset can be correlated with the covariates.

Finally, there is no specific statistical package to perform this kind of analyses. Nowadays, most of them include on their recent releases all the procedures to perform, at least, a basic longitudinal analysis. Now, there is no excuse for identifying a repeated/longitudinal analysis from a descriptive one, and developing them without any doubt…. Do you agree??

Featured

# Tricky bits: Ordinal data

From questionnaire responses studies to plant-flowering stages and drugs effects scoring analyses, researchers and biostatisticians often face situations when some of the variables are not continuous but present fixed ordered categories instead.

The straightforward approach would be to deal with this data taking into account its full nature. However, in practice, when trying to do the analysis, many professionals still take a continuous (interval scale) approach in order to ensure that the most familiar statistical techniques can be employed.

Carifio and Perla claim that this debate on the use and “abuse” of the so-called Likert scales has been going on for over 50 years, and state as one of the main advantages of what they refer to as “intervalist position”,  the easy access to both traditional and more complex techniques based on the former.

Other authors such as Jamieson and Kuzon Jr. et al advocate for tailored procedures to deal with this kind of variables. Their argument being based on the need to reflect the ordered nature of the data and the difficulty of measurement of distances between the different categories within variables in the case of a continuous approach.

Concerns regarding plotting this data have often been raised too. From pie to divided bar charts (see figure above and other examples here and here), it seems difficult to decide which form of visualisation is more appropriate and easily understandable.

Thankfully, the arrival of new graphical applications and user-friendly specialised software like R packages ordinal and MCMCglmm, is helping bringing consensus closer and closer so we can all be soon speaking the same “ordinal language”.

For or against? I look forward to reading your views on this!!

Note: A highly recommended book on the topic is Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed), Wiley.

Featured

# 2nd Biostatnet General Meeting Review

With a marked focus on young researchers in particular and health-related Biostatistics in general, this 2nd Biostatnet General Meeting, celebrated in Santiago de Compostela (Spain) the 25th and 26th of January, has been a fantastic opportunity for the Network´s members to gather together and discuss common topics of concern as well as successful stories.

FreshBiostats bloggers participated actively and now want to make our readers witnesses of this stimulating event.

7 of the 8 Biostatnet´s main researchers

After the welcome and opening session chaired by Carmen Cadarso, focusing on presentations on the past activities of the Network by Emilio Letón, David Conesa, Inmacularada Arostegui, and Jordi Ocaña, a busy program of events was fitted in a day and a half conference-like event:

Young researchers oral communications

Because of the meeting´s high participation, oral communications by young researchers of Biostatnet, were divided into three sections:

• BIO session

The topics discussed in this first parallel session were the choice of primary end-points by using a web application interface by Moisés Gómez-Mateu, the modeling of a non proportional hazard regression by Mar Rodríguez-Girondo, and the randomization tests implemented in clinical trials by Arkaitz Galbete. The second part of the session continued with two talks on Chronic Kidney Disease but from two different approaches: the first one, from a survival analysis (competing risks analysis) point of view, was presented by Laetitia Teixeira, and the second one, based on longitudinal analysis (Bayesian longitudinal models), was defended by Hèctor Perpiñán. Finally, Mónica López-Ratón presented his work on estimation of generalized symmetry pointS for classification in continuous diagnostic tests. This session was moderated by Carles Serrat.

• STAT session

A varied arrangement of talks were framed within the STAT session that featured the interesting view of Joan Valls on the experience of the biostatisticians working in the IRBLleida, two applications of Structured Additive Regression (STAR) models by Elisa Duarte and Valeria Mamouridis, a comparative analysis of different models for the prediction of breast cancer risk by Arantzazu Arrospide, an optimal experimental design application presented by Elvira Delgado, and a simulation study on the performance of Beta-Binomial SGoF multitesting method under dependence.

• NET session

In this third parallel session, topics such as “bio” research as well as others related to design of experiments were covered. Irantzu Barrio started with a talk on development and implementation of a methodology to select optimal cut-points to categorise continuous covariates in prediction models. Also in this session, Mercedes Rodríguez-Hernández presented her work on D-optimal designs for Adair models.

Also covering “bio” topics,  a talk on derivative contrasts in quantile regression was given by Isabel Martínez-Silva. María Álvarez focused afterwards on the application of the method of maximum combination when comparing proportions. The two last communications dealt with the cost-efectiveness study of treatments for fracture prevention in postmenopausal women by Nuria Pérez-Álvarez, and the application of Generalised Additive Mixed Models for the assessment of temporal variability of mussel recruitment, by María P. Pata.

Congratulations to the happy winners!!

To conclude these three sessions, Moisés Gómez-Mateu and Irantzu Barrio,  the two winners of both ERCIM´12 Biostatnet invited sessions, received their awards (see picture above).

Posters sessions

Two posters sessions were also included within the hectic program of the meeting, covering a wide range of topics varying, for instance, from the analysis of clinical and genetics factors (Aurora Baluja) to a collective blogging experience like ours (find it here)

As a courtesy to the young researchers participating in the meeting, Biostatnet´s main researchers gave each of us  The Cartoon Guide to Statistics, which definitely finds the fun side of Statistics (see the snapshot below for a nice example ;P). We are very grateful for this gift and promise to make good use of it, maybe by trying to convince those that are still skeptical about the enjoyable side of this amazing science!

Image extracted from “The Cartoon Guide to Statistics”

Roundtables

Throughout the meeting a total of 5 sessions of roundtables and colloquiums took place. Both professionals in the field of Biostatistics as well as young researchers participated and offered their views on different topics.

“Biostatisticians in biomedical institutions: a necessity?” was the first of the interventions of the meeting, which was covered by Arantza Urkaregi, Llorenç Bardiella, Vicente Lustres, and Erik Cobo. They attempted to respond the question with their professional experiences. The answer was unanimously positive.

The colloquium “Genomics, Biostatistics and Bioinformatics” was chaired by Malu Calle and featured presentations from Pilar Cacheiro (one of our bloggers), Roger Milne, Javier de las Rivas, and Álex Sánchez. They emphasized the importance of bringing together biostatistics and bioinformatics in the “omics” era, and a vibrant discussion followed regarding the definition of both terms.

The “Young Researchers roundtable” also generated a refreshing discussion about opportunities for young researchers in Biostatistics. Again, two of our bloggers, Altea Lorenzo and Hèctor Perpiñán, were involved in the session along with Núria Pérez and Oliver Valero, with Moisés Gómez as moderator and Isabel Martínez as organiser. The main conclusions reached in this table were the need for the young biostatisticians to claim their important role in institutions, the aspiration to access specialised courses on the field, and the importance of communication, collaboration, and networking.

In the second morning, another very important topic, “Current training in Biostatistics”, was presented by three professors, Carmen Armero, Guadalupe Gómez and José Antonio Roldán, who currently teach in Biostatistics masters and degrees programmes offered by Spanish universities. Some interesting collective projects were outlined and will hopefully be implemented soon.

Plenary talk

We cannot forget the invited talk on “Past and Current Issues in Clinical Trials” by Urania Dafni, director of  the Frontier Science Foundation-Hellas and Biostatistics professor and director of the Laboratory of Biostatistics of the University of Athens´ Nursing School of Health Sciences. An overall view of this hot topic and the importance of the presence of biostatisticians in the whole process of design and development of drugs was given by this reputed professional in the field. This session and the following discussion were moderated by Montserrat Rué.

Closing colloquium

Finally, a session on the future of Biostatnet and the different alternatives for development and improvement was chaired by Jesús López Fidalgo and María Durbán, with the collaboration of experts on international and national research projects funding, Martín Cacheiro and Eva Fabeiro, and Urania Dafni, director of the greek node of the International Network Frontier Sience.

From what was shown, it seems like the year ahead is going to be a very busy and productive one for the Network and its members. All that we have left to say is… We are already looking forward to the 3rd General Meeting!!

LONG LIFE TO BIOSTATNET!!!

Your comments on the Meeting and this review are very welcome,       let´s keep the spirit up!

Featured

Natàlia Adell is a graduate in Statistics from the Universitat Politècnica de Catalunya. She also has a master´s degree in Statistical and Operations Resarch from the same University. She worked in KantarMedia and in the Statistical Service of the Universitat Autònoma de Barcelona. At present, she works in the Statistical Assessment Unit of the Research Technical Services of the University of Girona.

+34 680778844

http://www.udg.edu/str/uae

1. Why do you like Biostatistics?

Because I like applied Statistics and if you can contribute to a good cause such as decreasing the number of illnesses, you will have all the right ingredients for good science.

2. Could you give us some insight in the work you develop at the Statistical Assessment Unit of the UdG´s Research Technical Services?

My main perception is that people need statisticians to help with a part of their research, studies… Statistics is a science that other scientists need and the Statistical Assessment Unit tries to provide it.

3. What were the main difficulties you found when setting up the unit?

The main difficulty was getting started. We had to organise the unit, establish all the procedures, and also let the community know about us. The most important thing I had was the support of all the people around me, who helped every time I needed it (and still do).

4. Is it possible to combine consultancy/advice and research?

Well, in our case, we dedicate ourselves just to the consultancy and giving advice because doing research is not the aim of the Statistical Assessment Unit. But it might be possible to combine both, because some doubts arise from research, and some questions need a research approach so they can be related.

5. What do you think of the situation of young biostatisticians in Spain?

I think  biostatisticians usually work alone, without the support of other statisticians and, in my opinion, it would be interesting to share knowledge with other biostatisticians. So I hope that BioStatNet and FreshBiostats will allow that!

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Listening, communicating and having a deep knowledge of Statistics. If you have these three characteristics, you can be a good biostatistician.

7. What do you think are the main qualities of a good mentor?

I think the most important skill is to be organised, knowing the steps you need to take to achieve your goal.  Explaining difficult technics in a clear way will also be appreciated.

8. Finally, is there any topic you would like to see covered in the blog?

Sample size could be a theme of interest!

Selected publications:

• Adell, N., Puig P., Rojas-Olivares, A., Caja, G., Carné, S. and Salama, A.A.K. A bivariate model for retinal image identification. Computers and Electronics in Agriculture. 2012; 87: 108-112. Epub 2012 June.
• M. A. Rojas-Olivares, G. Caja, S. Carné, A. A. K. Salama, N. Adell, and P.Puig. Determining the optimal age for recording the retinal vascular pattern image of lambs.  Journal of Animal Science. 2012; 90 (3): 1040-6. Epub 2011 Nov 7.
• Rojas-Olivares M.A., Caja G., Carné S., Salama A.A.K., Adell N., Puig P. Retinal image recognition for verifying the identity of fattening and replacement lambs. Journal of Animal Science. 2011; 89 (8): 2603-13. Epub 2011 Feb 4.
• Martínez-Vilalta J, López BC, Adell N, Badiella L & Ninyerola M (2008). Twentieth century increase of Scots pine radial growth in NE Spain shows strong climate interactions. Global change biology. 2008; 14, nº 12: 2868-2881.
Featured

# Another important R: Relative Risk

María Álvarez Hernández, BSc in Mathematics (University of Salamanca), is a PhD student in Statistics and Operations Research at the University of Granada, where she works with Professor Martín Andrés. Her line of research is framed within the statistical analysis of categorical data from contingency tables. Contact María

One of the common objectives of Health Sciences is to compare the proportions of individuals with a feature of interest in two different populations, for which purpose it is usual to take two independent samples. This is the case of comparing the proportion of cures with two different treatments, or the proportion of patients in the groups with and without a particular risk factor. In such situations, the parameter of interest is the difference between two proportions, but in the field of Medicine the parameter of interest is usually the ratio of two proportions. Examples about this are clinical trials which evaluate the effectiveness of a new vaccine, studies for comparing two binary diagnostic methods, studies of the comparison of two different treatments, etc.

From an exact point of view, getting a confidence interval for R is computationally very intensive, it requires specific computer programmes and it isn’t feasible for moderately large sample sizes (Reiczigel et al., 2008). Hence researchers have devoted a great attention to how to obtain approximate confidence intervals and, although many different procedures have been proposed, these have not always been compared. Nowadays, there is a general consensus that the best procedure is the score method proposed by Koopman (1984) and by Miettinen and Nurminen (1985). Alternatively, other simpler methods have been proposed which work more or less well (Farrington and Manning, 1990; Dann and Koch, 2005; Zou and Donner, 2008).

One piece of research in which I am involved is to improve these methods and to suggest  new ones that will allow us to achieve a result closer to the exact one, without losing rigor in the process (Martín and Álvarez, 2012). But although the improvement may be in a theoretical level, what happens in the computing scene?

From a practical point of view, obtaining confidence intervals for the relative risk through statistical packages such as SPSS20, Stata12 or StatXact10, also focuses on the asymptotic case, although in some of them, the researcher can actually obtain the exact confidence interval (in some situations incurring a long computational time). In general, the methods used are based on the ideas of Miettinen & Nurminen (1985) where it is assumed a standard normal distribution, Katz et al (1978) who applied the logarithmic transformation, and Koopman (1984) with the reputed score method. Sometimes, as it is the case of the StaXact software, it is allowed to apply the Berger & Boos correction because it reduces conservatism (it would result in shorter confidence intervals).

The aim must be not only to obtain the best methods in a theoretical way but also those that are more feasible when we carry out the explicit calculation and that involve shorter computational times.

Therefore, although the theory evolves, programmed routines in statistical packages to make inferences, for example about a measure of association like the relative risk, have not kept the pace like other techniques, considering that for the Health sector is a priority case.

In short, we should not be content with the implemented procedures and will spare no effort on research resources that allow us to improve them quickly and easily.

Featured

# Introduction to Bayesian statistics

After starting the year with a post about the International Year of Statistics, we present this week our first post about Bayesian statistics. This post has been made jointly by Hèctor Perpiñán and Silvia Lladosa.

Any researcher (particularly all of those working in the field of statistics) is aware of the two main approaches to this science: Frequentist or Classical statistics and Bayesian statistics. The main difference between Bayesian and Frequentist is essentially a distinct interpretation of what probability means, and thus a different way to make inference.

The term Bayesian refers to Bayes theorem that was originally started by the Reverend Thomas Bayes (1702–1761) in one of his last papers called “An Essay towards solving a Problem in the Doctrine of Chances ” published in 1763 (note that this year is the 250th anniversary).

Into this post, we focus on introducing the basic aspects that characterize the Bayesian framework. First of all, we give a brief and simple definition on the principal idea of Bayesian statistics: it quantifies and combines all the uncertainty in the problem (data, parameters, etc.) in probabilistic terms. It is therefore understood as the degree of belief.

The basic procedure of Bayesian methodology involves:

• Assigning an initial probability distribution, $\pi(\theta)$, to the model parameters ($\theta$), which quantifies all the relevant information in them. This distribution must be chosen before seeing the data (it can not by any means be conditioned by these).

Bayesian statistics has been often criticized because the interpretation of prior probability distribution in terms of ‘beliefs’ seems subjective. But this is far from reality, you can choose different priors: subjective (it should be used when you have some information about the parameters) or objective (in situations where there is no information on them).

Although it has not been explicitly mentioned above, the fact that we can express our beliefs about the parameter by means of  a probability density function is the result of considering the parameters as random variables. This is one of the biggest differences that can be found with respect to classical statistics. It treats parameters as fixed but unknown.

• Choosing a probabilistic model that relates the random variables and the model parameters associated with the experiment. This allows us to express the information provided by the data, given the parameters, in probabilistic terms by using the likelihood function, $p(y|\theta)$.

The last step in this procedure is to apply Bayes theorem, to combine prior knowledge and new information to find the posterior probability distribution, $\pi(\theta|y)$, of $\theta$,

$\pi(\theta|y)=\frac{p(y|\theta)\pi(\theta)}{\pi(y)}\propto p(y|\theta)\pi(\theta)$ .

The posterior distribution is updated according to the data, i.e. prior probability is changed by the new evidence provided by the data information into posteriors. We can say that “Today’s posterior is tomorrow’s prior ”. This final distribution will allow us to calculate point estimates of parameters, credible intervals estimates, to make predictions, etc.

After this brief introduction to Bayesian methodology, we will continue in our next posts with: prior distributions, Bayesian hierarchical models, WinBUGS and much more.

We hope your fears are going aside and you start to use this powerful paradigm. Because as a professor once told us: “Bayesian statistics is a way of life “.

Featured

# Happy New (International Statistics) Year!

With the start of the new year last Tuesday, it is now time to make resolutions and plans for the 12 months ahead. For scientists and especially for those who are involved in statistical matters, it will be a special one, since 2013 has been declared as the International Year of  Statistics by the American Statistical Association, the Institute of Mathematical Statistics, the International Biometric Society, the International Statistical Institute (and the Bernoulli Society), and the Royal Statistical Society. This year commemorates major events that were determinant for the evolution of Statistics. As mentioned on the International Statistics Institute website, “2013 will be the 300th anniversary of Ars Conjectandi, written by Jakob Bernoulli and considered a foundational work in probability…  and the 250th anniversary of the first public presentation of Thomas Bayes’ famous work.”

One of the FreshBiostats initial aims is to promote amongst young researchers, and within our limits, this science that is a complete unknown for many people, but plays at the same time a very important role in many other more popular fields -like Biology or Medicine in the particular case of Biostatistics.

It is with great pleasure that we find as one of the main objectives of Statistics2013 “nurturing Statistics as a profession, especially among young people”, and makes us very proud to be one of the participating groups supporting this and other also important goals. Hopefully, this initiative will contribute to the exponential trend that has been noticed in the interest of students towards this topic (you can find graphical representations of Harvard´s stat concentration enrollment here).

For further information, you can visit the website http://www.statistics2013.org/ and watch the launch video here.

We hope to make a significant contribution to this fantastic year, how will you take part in the celebration? Time is running out!!

Featured

# Dose Finding Experiments

In a past entry, I spoke about some issues in clinical trial design, explaining their structure and different phases.  Now, I am focusing in a part of these trials, the two first phases, which could be also a complete experiment by themselves, i.e. dose finding experiments.

The aim of a dose-finding experiment is a safe and efficient drug administration in humans. When a new drug (or procedure) is under study, we want to determine a safe dose of the drug for application but this dose should also be efficient. A balance between these two goals, non-toxicity and efficiency, is required in clinical trials.

Ethical concerns become essential in these experiments, same as in any experiment conducted on human beings, but especially because they are first-in-man studies, so the safety of the participants is the main worry. They also have a very small sample size, usually about 20 patients, therefore becoming a problem for the statistical analysis.

In phase 1 trials, the target is the evaluation of the maximum tolerated dose (MTD), the highest dose level with a pre-established observed toxicity rate. Depending on the risk of the experiment, a toxicity rate is fixed and the maximum dose level which does not exceed this toxicity rate is chosen. Then the recommended dose for the next phases of the study is either the MTD or one dose level less than the MTD. In phase 2 trials, we have an analogous experiment but now the target is the minimal effective dose, MED, which is the minimum dose level with a fixed efficiency rate. It is also common to try to combine these two targets in a single experiment, estimating a toxicity-efficiency curve and looking for an optimal dose that mixes these two goals.

A wide catalogue of designs for dose-finding experiments could be found in the statistical literature. The initial dose, the dose escalation, the stopping point and accuracy in the estimation of the MTD and MED are the main concerns in a design and they still are a fertile area for ​​research. A classical design in phase 1 is the traditional 3+3 design, very used in oncology experiments. In this design, patients are assigned in groups of three and the trial starts in the lowest dose level.  The first three patients are assigned and if it does not show any toxicity, we assign the next patients to the next level; if there is one case of toxicity, we repeat the experiment in the same level, and if there are two or more toxicities in the same level, we conclude that we have exceeded the MTD. This procedure is repeated until we exceed the MTD.

Phases 1 and 2  have been much less treated theoretically than phase 3 of clinical trials, but with this post I wanted to show up their  importance and try to make them more understandable to people who work in statistics.

Featured

# Handling multiple data frames in R

I am not -yet- that highly skilled at programming in R, so when I run into a function/package that really meets my needs, well, that is quite a big deal. And that is what happened with the plyr package.

I often have to work with a high number of data frames. It is not about “big” statistics here, but just some basic descriptives, subsettings…  It has to do more with handling this amount of data in an easy and efficient way.

There are two routines when manipulating these data sets I find essential. The first one is being able to operate with all these data frames at once (eg. subsetting  for filtering) by creating lists.

So let´s say we have a certain number of data frames, file_1, file_2, … each of them with the same variables named  var1, var2,… and want to subset all of them based on a certain variable.

dataframes <- list.files(pattern = ”file_”)
library(plyr)
dimensions <- ldply(list_dataframes, dim)
filter <- llply(list_dataframes, subset, var1 == ”myvalue”)
selection <- llply(list_dataframes, subset, select = c(var1,var3))


No need for “for” loops here!  It is certain much neater and easier this way. More information about llply, ldply or laply can be found at the plyr R tutorial. Much has been said about its advantages in other blogs, you can check it here or in my “indispensable” gettinggeneticsdone.

The second one would allow us to identify common values between those data frames. The first function that comes to mind is merge. Again several useful posts about it (gettinggenetics done, r-statistics) and it sure serves the purpose in many cases. But quite frequently, you find yourself in the situation where you have got several data frames to merge ;  merge_all and merge_recurse in the reshape package overcome this problem. There is an excellent R wiki entry covering this topic.
As an alternative to merge, join (again in the plyr package), lets you specify how duplicates should be matched.

Note that both packages- plyr and reshape- are developed by Hadley Wickham, ggplot2´s creator.

These functions have become part of my daily routine and they definitely save me a lot of trouble. I have yet to explore another package I read great things about: sqldf.

Do you have any other suggestions on manipulating data frames?

Featured

# Big data analysis: a new ongoing challenge

Science, as other areas of knowledge, has had its own evolution: Before 1600, it was referred to as empirical science. The second period, from 1600 to 1950 approximately, is the called theoretical period, where each discipline developed theoretical models which often motivated experiments and broadened our understanding. Afterwards, the computational part came, and after 40 years, these disciplines have made the computational branch grow based on simulations to find solutions for complex mathematical models. Since 1990, and after the spread of computer implementation over the world, a new period has started: as the technology advances, the size and number of experimental data sets are increasing exponentially, mainly thanks to the ability to economically store and manage petabytes  (more than terabytes!) of data and to its easy accessibility. For instance, in modern medicine, there is now a well-established tradition of depositing scientific data into a public repository, and also of creating public databases for use by other scientists: researchers collect huge amounts of information about patients through imaging technology – CAT scans-, DNA microarrays, etc… This is what we call big data.

But… what about the process of analyzing this kind of data? How to handle this amount of data? As a biostatistician and at first glance it does not seem like a complex problem to manage. Essentially, the main procedure is based on the extraction of interesting (non-trivial, previously known and potentially useful) patterns from the big data set.  This is what they call data mining or machine learning process. This technique was initially applied in business (e.g., to identify the profile of customers of a certain brand) but nowadays, the impact of data abundance extends well beyond this discipline.

There are several reasons that support the usage of huge data sets but, mainly: (1) they allow to relax assumptions of linearity or normality of the variables collected in the databases; (2) we can identify rare events or low incidence populations; (3) data analysts can generate better predictions and better understanding of effects.

As for the functionalities of the data mining technique, there are numerous patterns to be mined but, I will focus on those which are more relevant and applied to biomedical sciences: classification, cluster analysis and outlier analysis. In the first one, data analysts constructs models to describe and distinguish classes or concepts for future prediction. The typical methods within classification are the well-known decision trees or logistic regression models. On the contrary, cluster analysis groups data to form new categories to find distribution patterns, maximizing intra-class and minimizing interclass similarity at the same time. In next posts I will give a more extensive explanation of certain methods belonging to these special functionalities…

Finally, it must be pointed out that this kind of data sets can be analyzed by means of the most used statistical software (e.g., SPSS, SAS or R) and although sometimes some operations may be CPU intensive, it is definitely worth the effort to find out a solution to this ongoing challenge: big data analysis.

Are you prepared to face it?

Featured

# Invitation to the Biostatnet 2nd General Meeting

Hi everyone!

As active members of the National Biostatistics Network Biostatnet, we would like to invite all the young (and not so young) members to participate in the Biostatnet 2nd General Meeting that will take place the 25th and 26th of January 2013 in Santiago de Compostela. You can find the program of the event here and can request any further information at biostatnet@gmail.com.

Apart from some really interesting talks and roundtables, this meeting has a main focus on young researchers, and as such we believe it is particularly important that we participate and try to get involved in this amazing network that is setting us closer and closer despite the physical separation.

Remember!! The deadline for submission of posters and oral communications is the 14th of December, so HURRY UP!!

FreshBiostats will be represented in the roundtable that will be held on the 25th so it would be fantastic to get your comments as to which topics you would like to see discussed or any issues that affect you directly. Thank you for your collaboration!!

Featured

# B for Biology: Not just counting sheep

As some of my co-bloggers have mentioned before, Biostatistics has been closely associated lately with studies in the health sciences and has somehow forgotten the wider biological side of things. I will be focusing today on ecological and environmental matters and the statistical approach to this kind of problems.

According to Smith (Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 589-602; John Wiley & Sons, 2002), ecological Statistics can be defined as the area of Statistics that focuses on ecological problem solving, where Ecology can be understood as the scientific study of the distribution and abundance of organisms. It will cover, therefore, “sampling, assessment, and decission making for both policy and research” (Patil, G.P. , Environmental and Ecological Statistics; Encyclopedia of Environmetrics, Vol. 2, pp 672-674; John Wiley & Sons, 2002), and will require of advanced techniques to ensure the correct modelling of complex univariate and multivariate relationships (often nonlinear) from both spatial and temporal perspectives.

To fully understand this field of study, we would initially need to make a clear distinction between single species and multispecies analysis, two diametrically opposed approaches calling for different statistical strategies.

The former is mainly based on measurements of the species abundance and performance (survival, growth, and recruitment). As such, it encounters an old dilemma: how to keep observational bias to a minimum? Petersen and transect methods are used to monitor wildlife census and avoid this and other biases, and advanced methodology like mixed models, flexible regression techniques, spatial and temporal statistics, and Bayesian inference are applied in the analysis itself.

Multispecies analysis on the other hand, deals with the complicated interactions and dependencies existing in the various ecosystems. The notions of diversity – measuring global changes in different species as a community, and mostly criticized for the potential lack of ecological relevance of some of the measures – and integrity – metrics accounting for a certain ecosystem unimpaired state; read this interesting article for further discussion on the difference between health and integrity –  are its main pillars. Multivariate analysis of ecosystems includes methods like correspondence analysis and redundancy analysis, amongst many others.

It is worth noticing that the latter is nowadays a major focus of research as a direct consequence of an increasing public awarenes of the need to preserve endangered ecosystems in order to ensure the whole planet´s good health.

In the end it is not just about counting sheep but how to count them, ensuring representativity, and considering issues like diversity and integrity in their relationships with other species.

Note: as a proof of the importance of this broad area on its own, there is a multidisciplinary journal dedicated to the topic, “Environmental and Ecological Statistics”, and an exhaustive R Task View called “Analysis of Ecological and Environmental Data” available here. “Analyzing Ecological Data” (2007) by Zuur, Ieno & Smith is also highly recommended.

Have you faced any of these problems? Any tips? Many thanks for your comments!!

Featured

# Biostatistics software review

Nowadays, most of us would not be able to perform our daily job without software. It is therefore essential to choose the right one because either we want it or not, it will become our (sometimes hated, most times loved) closest companion.

Thanks to the fast development of technology and trying to obtain an answer to more complex biomedical problems, several software manufacturers have produced statistical packages oriented to different fields of Statistics.

In this post we intend to give an overview of some of the software available and in use in biostatistical research by classifying them in three main categories, i.e, general use, specialized and tailored alternatives.

General use

S-Plus/R

S-Plus and R  are both statistics and programming environment software. They provide the opportunity of customized data analysis coding using a high level programming language. It can be said that R and S-Plus are quite close, since they speak the same dialect – the code is the same –  and consequently, the syntax can be used under the other platform without any change. Conversely, the main remarkable difference between both programs is that R is a GNU licensed software, that is, it is free and can be accessed and adapted to suit each researcher data analysis requirements.

Among the multiple R user-friendly interfaces available, we would highlight the following:

• RStudio is a free and open source integrated development environment for R, that can run under Windows, Linux, Mac or even over the web using RStudio Server. As a special feature, it is organized into four different work areas: the console for interactive R sessions, a tabbed source-code editor to organize a project’s files, another frame with the workspace as well as a history with the commands that you have previously entered and finally a frame that provide us with an easy administrative tool for managing packages, files, plots and help.
• R Commander´s main advantage would be the fact that it does not require to download the interface itself. You can just access it by simply calling the package Rcmdr from your R console and it allows for both options-selection and coding. However, it is somehow limited in the choices for selection.
• RKWard is meant to become an easy to use, making R programming easier and faster, by providing a graphical front-end that can be use by inexperienced users in R-language as well as experts. As RStudio, it can be run under Windows, Linux and Mac and cannot be loaded from within an R session (like R commander), but it has to be started as a stand-alone application.
• Deducer is another graphical user interface (GUI) for R that avoids the hassle of programming. Amongst its outstanding features, we would highlight its plot builder tool with multiple customisation options.

As a particular application of R it is worth mentioning one widely used in the analysis of genomic data:

• Bioconductor, with more than 600 R packages, is focused on the analysis of high throughput genomic data including analysis of microarrays and dealing with sequence data or variant files such as those generated by Next Generation Sequencing projects.

SAS

Statistical Analysis System (SAS) is an integrated software package which allows to program tasks such as statistical analyses, reports of results and operational research studies or quality improvement.  Though it is oriented mostly to business or insurance enterprises, SAS has become an important tool in biomedical research in latter years. It must be pointed out that the code is based on PL/1 language.

IBM SPSS

Although mainly used in the Social Sciences field, this software is often chosen by professionals in the area of Biomedicine for its ease of use and attractive graphics.

STATA

STATA (Statistics+data) is another well-known package for data analysis. It was created in 1985 by StataCorp and its use is focused mostly on business or epidemiology research. For the current version details, go here .

The above mentioned statistical packages are the most used in our field. But, many times, as the statistical analyses require, specific software is required to obtain a solution to our problem. Other software that might fit more specific needs is detailed below.

Specialized

WinBUGS

WinBUGS is a statistical software for analyzing Bayesian complex probability models using Markov chain Monte Carlo (MCMC) methods. This software is part of the BUGS (Bayesian inference Using Gibbs Sampling) project. It was created to run under Microsoft Windows as an independent program but it is possible to access it through the package R2WinBUGS from R software.

There is another version of WinBUGS called OpenBUGS, an open-source version of the package, which it can be called from R (with package R2OpenBUGS) and SAS, amongst others. Another alternative to WinBUGS (an open source program) is JAGS (Just Another Gibbs Sampler) and can be accessed through R via R2jags or RJags.

MLwiN

It is an important package for fitting multilevel models developed by the Bristol University. Its main feature is  an equation window where one can write the model with the parameters to be estimated.

Mplus

The general modelling approach of Mplus is to describe the collected data by means of latent variables and path diagrams. Thus, the statistical techniques mainly used are exploratory and confirmatory factor analysis, path analysis, and  hierarchical models.

Tailored alternatives

• EpiLinux is an operating system especially orientated towards those professionals, researchers and students working in the areas of Epidemiology, Biostatistics, and health studies in general. EpiLinux 3 is based on GNU/Linux Ubuntu 12.04 LTS with Lightweight X11 Desktop Environment (LXDE) and is a joint project of the Dirección Xeral de Innovación e Xestión da Saúde Pública de la Xunta de Galicia and the Biostatistics Unit of the Universidad de Santiago de Compostela. For further information and download, visit the following website.
• BioStatFLOSS , similarly to EpiLinux but restricted in this case to Windows operating system, gathers programs specifically designed for the implementation of epidemiologic, biostatistical and health studies in general. Its major advantage is the fact that no installation is required. You can download it here.
• Epidat is a free user-friendly programme developed by the Servizo de Epidemioloxía de la Dirección Xeral de Innovación e Xestión da Saúde Pública de la Consellería de Sanidade (Xunta de Galicia) with the institutional support of the Organización Panamericana de la Salud (OPS-OMS) and purposefully built for the analysis of epidemiologic data. More information can be found here.

All these tools will definitely make your life as a  biostatistician so much easier, but now it is your choice!! You could even keep on doing your number crunching by hand :-)

Featured

# Invitation to the XIV Spanish Biometric Conference 2013

Elvira Delgado Márquez, MSc in Applied Statistics, BSc in Computer Engineering and BSc in Statistics (University of Granada) is a PhD student at the University of Castilla-La Mancha where she works with Professor López Fidalgo and Dr. Amo Salas. Her area of expertise is Optimum Experimental Designs.

-

The term “Biometry” has been used to refer to the field of development of statistical and mathematical methods applicable to data analysis problems in the biological sciences. Statistical methods for the analysis of data from agricultural field experiments to compare the yields of different varieties of wheat, for the analysis of data from human clinical trials evaluating the relative effectiveness of competing therapies for disease, or for the analysis of data from environmental studies on the effects of air or water pollution on the appearance of human disease in a region or country are all examples of problems that would fall under the umbrella of “Biometrics” as the term has been historically used.

Recently, the term “Biometrics” has also been used to refer to the emerging field of technology devoted to identification of individuals using biological traits, such as those based on retinal or iris scanning, fingerprints, or face recognition. Neither the journal “Biometrics” nor the International Biometric Society areengaged in research, marketing, or reporting related to this technology. Likewise, the editors and staff of the journal are not knowledgeable in this area.

On behalf of the Spanish Biometric Society, the area of Statistics and Operations Research at the University of Castilla – La Mancha welcomes the celebration of the XIV Spanish Biometric Conference – 2013 that will be held in Ciudad Real (Spain), from the 22nd to the 24th of May, 2013.

Full information can be found at the Conference´s website  as well as contacting the following e-mail address: biometria2013@gmail.com

We invite scholars willing to promote de development and application of the mathematical and statistical methods in the areas of Biology, Medicine, Psychology, Pharmacology, Agriculture, Bioinformatics and other areas related to life sciences, to come to Ciudad Real and participate in the presentation of the latest results in these areas.

Furthermore, the Biometrical Journal (edited in cooperation with the German and the Austro-Swiss Region of the International Biometric Society), indexed in Journal Citation Reports (JCR), will publish a special issue with a selection of the papers presented at the conference.

We remain at your disposal and we look forward to welcoming you in Ciudad Real very soon.

Elvira Delgado on behalf of the organizing committee.

Featured

# Appearances can be deceiving

Anabel Blasco, BSc in Statistical Techniques and MSc in Statistics and Operations Research (Universitat Politècnica de Catalunya), and MSc in Mathematics for Finance (Universitat Autònoma de Barcelona), works as statistical consultant and training area coordinator at the Servei d´Estadística Aplicada of the Universitat Autònoma de Barcelona. Contact Anabel

I’m a statistical consultant. While developing my job, I have assessed many applied researchers: from botanists to andrologists, and performed many different statistical analyses: from a simple t-test, to more sophisticated analyses which are resolved through advanced statistical modelling. In order to evaluate the needs of researchers, I find necessary to meet him and let him explain the study goal, show the available data and detail of their statistical doubts. After the meeting, I usually know what kind of analysis is required.

At this point, I think we should not underestimate any study despite of what it may seem at first sight, and I think it is a serious mistake to do it. Let me explain.

As a statistical researcher, I like to work with data that test my analytical abilities while trying to extract its maximum profit. However, not always a high-level analysis is required; sometimes the simplest analysis satisfies researcher needs and expectations. Only sometimes, some seemingly harmless data, conceal a sophisticated statistical analysis that initially had gone unnoticed.

Some months ago, I had a meeting with two biologists. Their study dealt with predation of certain type of plant by some insects in different regions. They tried to use a simple ANOVA test, compare the number of plants affected by predation among regions. But, the test did not give statistically significant results. A statistician realizes quickly what is wrong: “Maybe, you are not taking into account the variability among regions and, of course, you don’t have normal data because you are dealing with counts”.

Homogeneity of variances and normal distribution are two important hypotheses in the ANOVA test. To solve the problem of non-constant variances, different alternatives are possible, for example using transformations. The most common data transformations are the proposed by Ascombe (1944) and the Box-Cox transformations (1964). These transformations not only solve the problem of non-homogeneity, but they also reduce data anomalies such as non-additivity and non-normality. Transform the data is a good solution but we can go even further. In 1972 John Nelder and Robert Wedderburn formulated the generalized linear model (GLM), a flexible generalization of the linear regression model allowing for response variables having other than a normal distribution.

Since we are evaluating counts, a GLM using Poisson distribution could be applied. The result remained the same: statistically non significant differences in count predation among regions. We started with ANOVA, then transformed the data obtaining variables with theoretically nice properties, estimated a GLM with Poisson distribution and, at the end, we were at the same point. There was something wrong. In fact, there was a subtile difference among regions: one of which had much more zero counts in contrast to other regions. These zero data could be treated in a more proper way.

The response to this problem appeared in the nineties: zero inflated Poisson models. These models are a way of dealing with overdispersion. The model assumes that the data is a “mixture” of two sorts of individuals: one group whose counts are generated by a standard Poisson regression model, and another group whose individuals have a large frequency of 0. Thus, this approach can take into account the excess in zero counts. Therefore, a zero inflated Poisson model (ZIP) was claimed to solve our problem. Moreover, in this setting, not only a Poisson can be assumed, but a Negative Binomial distribution could also be assumed (ZINB). This led me to further investigation, comparing ZIP and ZINB models with GLM with Poisson and NB distributions by using appropriate tests. The decision of using one or other model not only can be done from a statistical point of view but also using the biological interpretation.In this case, we saw that a ZINB model could model not only the count process for the data predation but also the process for zero predation.

The lesson of this story is that sometimes a simpler study can hide the most sophisticated analysis. Never underestimate the difficulty of a simple experiment because appearances can be often (and very often) deceiving.

By Anabel Blasco

Statistical Consultant

Featured

# Interview with…Isabel Martínez Silva

Isabel Martínez Silva is a researcher, statistical consultant and PhD candidate at the Biostatistics Unit of the University of Santiago de Compostela. Contact Isabel

1. Why do you like Biostatistics?

I find Biostatistics particularly interesting in the sense that not only doest it allow you to learn about Statistics but it also gives you  the opportunity to cooperate with professionals that require of our statistical knowledge for their research in the Bio sciences (Medicine, Biology, Odontology, Veterinary, etc.). It is true that there is a need for advances in Statistics and we work on that in our research projects, but it is also essential to share our knowledge and train professionals from other fields in the latest statistical techniques in order to provide better and more accurate results for their research work and so as to promote interdisciplinary, which is crucial for the improvement of any discipline.

2. Could you give us some insight in your current field of research?

My PhD work focuses on smoothed quantile regression and its applications in Biomedicine.

One of the most known examples for the general public of the applications of the technique in this field would be the study of growth curves. In general, every children´s growth is followed up by their paediatricians since they are born. In these revisions, measurements of weight, height, and age are taken and allow them to check the growth of infant population. The need for smoothing in this case  as well as the differences between boys and girls are patent.

My latest research in this area was presented at the JEDE II conference last July, and focuses on quantile regression hypothesis testing. We basically wonder whether boys and girls´growth distributions and percentiles are actually different. In the case the distributions were not different, we would not need to calculate different percentiles for each sex, and in case they were, it does not necessarily mean that the percentiles have to be different. In order to answer these two questions, bootstrap hypothesis testing has been applied that allows us to assess the statistically significant differences both between distributions and between each of the percentiles by sex.

3. Do you find it difficult to combine research and advice in Biostatistics?

Yes, I think it is particularly difficult, mainly because of the system inflexibility and the centers internal bureaucracy. For instance, in the medical environment, Biostatistics is usually understood as part of Epidemiology and in the statistical world, Biostatistics is also considered a subset of Statistics. I, personally, believe both notions are incomplete. Biostatistics starts within the Statistics frontiers but then crosses them when being complemented with the contributions from Epidemiology that do not have a place within purely mathematical subjects. Furthermore, in modern Biostatistics the use and creation of specific software for the implementation of statistical techniques is indispensable, and this is something outside Epidemiology aims. From my point of view, all these facts position biostatisticians within Statistics but always building bridges with the Bio environment to whom they must listen and try to understand so as to give value to the appropriate statistical techniques for each particular study.

4. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Statistics, Computing, and interdisciplinarity.

5. What do you think of the situation of young biostatisticians in Spain?

I believe it is very complicated and is mainly centered around universities.From my point of view, Biostatistics is nearly absent in Spain´s private sector and its presence in research centers and/or public foundations is unequal. Incorporating this to the state of the Spanish current market, makes the future of young biostatisticians outside the university really tough, contrary to what happens in Europe and US.

6. Which do you think are the main qualities of a good mentor?

Accesible, motivational, and innovative.

7. Finally, is there any topic you would like to see covered in the blog?

I find that it has covered a wide range of areas for the very short time that has been going on. Congratulations, you are doing a great job!!

Selected publications:

• Martínez-Silva I., Lustres-Pérez V., Lorenzo-Arribas A., Roca-Pardiñas J., Cadarso-Suárez C. Flexible quantile regression models: application to the study of the sea urchin, Paracentrotus lividus (Lamarck, 1816). SORT (Under review).
• Carballo-Quintás M, Martínez-Silva I, Cadarso-Suárez C, Álvarez-Figueiras M, Ares- Pena FJ, López Martín E. A study of neurotoxic biomarkers, c-fos and GFAP after acute exposure to GSM radiation at 900 MHz in the picrotoxin model of rat brains. Neurotoxicology, 32 (4),   pp:478-494 , August 2011. D.O.I.: http://dx.doi.org/10.1016/j.neuro.2011.04.003.
• Cubiella Fernández J. , Núñez Calvo L. , González Vázquez E. ,  García García M. J. , Alves Pérez M. T. , Martínez Silva I. , Fernández Seara J. Risk factors associated with the development of ischemic colitis. World J Gastroenterol 16(36), pp. 4564-4569. September 2010.
Featured

# Who can be a biostatistician?

Nowadays, Statistics and more specifically Biostatistics is increasingly becoming an important and essential tool in the area of scientific and technical research for everyone who works in very diverse contexts linked to human health, ecology, environment, agriculture, etc.

With the new advances in technology, the extraction and storage of information to create statistical databases is becoming an easier and more feasible task. That is the reason why medical researchers, biologists, chemists, and other professionals not related to Mathematics or Statistics may need to learn a range of statistical techniques to process their data. However, it would be erroneous to expect that neither mathematicians nor statisticians need additional training to know widely the work that biostatisticians carry out. Mathematical knowledge (usually mainly theoretical) is not enough. Ideally, some training in the Bio sciences area would be required too.

But, how to be a biostatistician? What kind of studies do you need? Nowadays there are many formative courses and Master’s degree aimed to biomedical researchers. Here, we will talk about some of them.

• Basic and specific courses:

On the website of Biostatnet and the Spanish Region of the International Biometric Society (Sociedad Española de Biometría, SEB), among many others, we can find a number of basic and specific courses targeted to health researchers. These courses may be orientated towards professionals who are not statisticians (like this) or may have a more complex content (like this). The Servei d’Estadística Aplicada of the UAB (Universitat Autònoma de Barcelona), is an example of interdepartmental service with a lot of courses and seminars of different levels.
There are also public healthcare institutions, among them EVES (Valencian School for Health Studies), which sometimes give courses specially aimed to doctors and nurses. It should be underlined that many scientists work daily with simple tests, as t-student, and they need to understand it.

Furthermore, if you need better training there are also different Master’s and postgraduate degrees that offer you high specialization.

Currently, in Spain there are few universities that offer Master’s degrees purely Biostatistics. Most of them are combined with other branches of Statistics such as Bioinformatics (one of the newest tools in Genomics). Others give simultaneously Statistics and Operations Research, where Biostatistics itself is part of the agenda. At present, the Universitat de València is the only one that offers a Master’s degree in Biostatistics.
The following are some of the Master’s degree and postgraduate courses in Statistics taught in Spanish universities:

1. Master’s degree

Máster en Técnicas Estadísticas (Interuniversity master degree between UDC, USC and UVIGO)

As we can see, in Spanish universities, Biostatistics has yet to be noticed. That is why it would be interesting that they would focus on a science that is winning more and more followers. Do you want to join? Biostatistics is breaking down walls!

Featured

# Graphics: an important issue to communicate

When we think about Biostatistics we usually have in mind some more or less complex modelling examples such as linear models, generalized linear models, etc. However, part of our job is to report our results to non biostatistical collaborators and we need to be able to explain and talk about them. To do this, a great tool sometimes “forgotten”, are graphics.

In the last decades the  R – Project for Statistical Computing (known as R) software has grown more than any other, thanks to contributions (packages) that researchers around the world share with the rest of the scientific community. One of the highlights of R is its versatility and customization for performing graphics.

If you are reading these lines the probability that you know how to plot with R using base graphics like plot(), barplot(), hist (), etc., is very high. My intention in this post is to present two packages that can radically change the look of our graphics making them more  professional and nice-looking. The name of packages are lattice and ggplot2, both focus mainly on multivariate data but are flexible and support univariate data also.

Here is a self-explanatory description of lattice by its author: “Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data, that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements”.

Lattice uses a simple code very similar to the syntax in base graphics and supports 3D plots. There is a very interesting book on it called Lattice: Multivariate Data Visualization with R written by Deepayan Sarkar.

With regard to ggplot2, the author describes it as: “a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics”.

Ggplot2 uses a special syntax to construct the graphics showing an interesting way to think about plots based on the book The Grammar of Graphics by Leland Wilkinson. The creation of plots is layer by layer. By default the plots are very elegant, sober and professional, but it also allows for high customization (when the syntax is known). It doesn’t support 3D plots though. It has a website to document and explain the package and it is worth mentioning the book ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham.

To know more about lattice and ggplot2, I would recommend reading the exhaustive comparative in several posts on Learning R, starting with this one. Now it is your turn to check all this, you will love it!!!

Featured

# Some issues in clinical trials design

A clinical trial is an experiment performed on human beings to measure the efficacy of a new treatment under study. The treatment could be a new drug under study, a new therapy, a surgical procedure, or any new clinical procedure that needs to be approved. Then, clinical trials play a very important role in drugs development and pharmaceutical research, because any new drug or procedure has to pass a thorough examination, often very regulated by the national regulatory drug administration of each country. Like any experiment, it has a strong statistical background in all the design, the recruiting and follow-up of patients, and the analysis of the results.

Conventionally, drug trials are classified into four phases, with each phase having a different purpose:

•  Phase 1: Determine the potential toxicity.
•  Phase 2: Preliminary study of efficacy and toxicity.
•  Phase 3: Final test comparing the drug with a commonly used treatment or a placebo.
•  Phase 4: Post approval follow-up of patient status.

Usually, the different phases are considered like separate clinical trials. Each phase of the drug approval process can be considered as a separate clinical trial and it requires different statistical analysis.

Phase 1 and 2 cover small to moderate size experiments (20-50 patients) and they are centered in the determination of toxicity and efficacy, so the final aim is to get an estimation of the toxicity-efficacy curve. Usually, different doses are tested in the patients and the measurement of the responses gives an estimation of the optimal dose to ensure the maximum efficiency without producing toxicity. There are many designs for these phases, based in optimality criteria, the use of Markov Chains…
Phase 3 is the longest one in the trial; it can have thousands of patients involved, and is also the most complex. As we stated before, the new treatment is compared against commonly used treatments or a placebo, so we have to assign the different treatments to the patients that start the trial. There is a wide catalogue of phase 3 designs in the literature; an exhaustive review is given in Rosenberger and Lachin (2002). If the drug successfully passes through Phases 1, 2, and 3, it is approved by the regulatory agency. Finally, Phase 4 involves delineating additional information, including monitoring the treatment’s risks, late-developing side-effects, benefits, and optimal use.

In the process of designing a clinical trial we have to deal with different issues. For example, in phase 3, the principal objective is to provide an unbiased comparison of the difference between treatments.  We have to avoid the different biases that appear in the study. These biases can come from patients, physicians or some unknown covariates among other factors. A powerful tool to avoid this problem is the random assignment of patients. This kind of trials are called randomized clinical trials and they use different probability rules in the assignment of treatments to patients. However, randomization alone does not avoid all biases, for example, wherever possible, clinical trials should be double-masked, i.e.,  neither the patient nor the physician should know the treatment that has been allocated to the patient.

Finally, although it is well known the importance of the use of statistical tools to carry out any experiment, in these cases, due to their complicated structure and strict regulation they become essential in order to make rigorous and efficient clinical trials.

Featured

# Approaching Statistical Genomics

I am sure you heard about the ENCODE project. It has been all around the news last month. Along with other milestones like the Human Genome Project, HapMap or 1000 Genomes, it is a good example of the level of understanding of the human genome we are achieving.

Next Generation Sequencing (NGS) allows DNA sequencing at an unprecedented speed. Genomic projects involve mainly exome (protein coding regions of the genome) sequencing right now, but the technology is rapidly evolving, and soon enough it will be cost-efficient to sequence whole genomes. Undoubtedly these projects will account for a good part of genomics research fundings.

So far a quick and brief overview of what is happening in genomics right now and what is about to come in the near future. But, what does all this mean from a statistical point of view? To say it plain and simple: a huge amount of data will need to be properly analyzed and interpreted.

Between 20.000 and 50.000 variants are expected per exome. Examining an individual´s exome in the search for disease-causing mutations requires advanced expertise in human molecular genetics. We could wonder what happens when we talk about comparing multiple sequence variants among members of families (e.g. linkage analysis for monogenic disorders) or populations (e.g. case-control studies for complex disorders). High dimension data are nowadays the rule, and sooner or later anyone working in genomics will face problems that require knowledge in bioinformatics and in specific statistical methods to be solved.

Since one of my fields of interest is the identification of susceptibility genes for complex disorders, I thrive on the new challenges that NGS presents, in particular the possibility to perform rare variants analysis. Ron Do et al. have just published a complete review on this subject.

I am just focusing here on what is usually referred to as tertiary analysis in a NGS pipeline, i.e. analyzing and extracting biological meaning of the variants previously identified. However, we should not forget the opportunities in the development of base calling, sequence alignment or assembly algorithms.

Furthermore,  DNA/exome-sequencing is just one piece of the cake. Some other statistical issues arise in the analysis of other high-throughput “omics” data such as those coming from RNA-seq, ChIP-seq or Methylation-seq studies.

The message of this post: to date, the capacity for generating genomic data is far beyond the ability to interpret that data. Whether you are interested in developing new statistical methods or considering a more applied career, there is no doubt that statistical genomics is a hot field right now!

As an extra incentive for those coming from a mathematical background, you will get to work closely with geneticists, molecular biologists, clinicians and bioinformaticians among others. Interdisciplinarity being one of our blog mottos, statistical genomics wins by far…

Featured

# Biostatistics…why?

Biostatistics as a science is a subdiscipline of Statistics which studies the patterns behind biological processes (e.g., the spread of a disease). Scientists use different methods – from standard statistical methods to complex models – to analyze huge data sets so that researchers can obtain an answer to these biological enigmas. But…Biostatistics….why? This is the question one should address when starting to work in this field. Biostatisticians are often asked to justify why they choose this area to start or even improve their professional career.

Data analysis has always been performed. Before the 19th century, most scientists with a basic knowledge in Statistics were able to carry out simple calculus to validate their daily scientific experiments. The starting point of modern Biostatistics applications was set up in the past two centuries, with Charles Darwin and Francis Galton, among others. Besides, the latter one was the cofounder of the well-known statistical journal Biometrika. In the last decades, the complexity of scientific research studies (design, studied sample…) and the development of technology have grown enormously. This has led to the development of complicate statistical methods – sometimes ad hoc – and, consequently, to the requirement of specific skills for performing them: apart from Statistics, knowledge in medical topics and computer programming is highly recommended.

There are several papers which remark the importance of a biostatistician in biomedical sciences (e.g., Bross (1974); Donald W. Marquardt (1987), Greenhouse S. (2003); María Jesús Bayarri et al. (2012)). It is clearly revealed that the role of a data analyst – we are often called this way, and I have to admit I somehow dislike this term – is not as simple as the one of a shoe store clerk: I mean, we cannot sit and wait for requests coming from clinicians or other researchers who need to develop multiple regression analyses (most times) to obtain results. A statistician must be ambitious, have adventures with data, “play” with them and search for better statistical strategies than the current ones. There is always place for improvement. We are seen as data-machines/compilers looking for statistical significance (p<0.05) and we should show to other professionals that our daily work: (a) is not based on “significance”; (b) may influence resulting policy choices made by governments or other important organizations. In other words, the general public should perceive that our role is much more than pressing a button and getting the result in 5 minutes. Our function is to challenge and influence the community in order to hopefully make the society a better one. Fortunately, important biomedical journals such as Journal of American Medical Association (JAMA) have begun to give more relevancy to the complexity of the statistical procedure. It is the first step.

Biostatistics in Spain

Compared to other countries, Biostatistics could be considered as an emerging discipline in Spain. Although there is still much work to do, it is remarkable that it has been perceived an increasing demand for biostatisticians in the scientific community. Due to the new rising areas such as Genomics, spatial Statistics or Functional Data Analysis, several multidisciplinary research groups have been set up with at least a biostatistician being part of it. The National Biostatistics Network BIOSTATNET is a proof for this. This network, created in 2010, is composed of 8 nodes from different regions of Spain aiming to coordinate and promote research in Biostatistics.

In my opinion, I think I have given many reasons for choosing Biostatistics as a profession. It is a field where you can be linked to people coming from different areas which allows you to learn about many more topics than expected. In a few words, Biostatistics grips you!

“Statistical thinking will one day be as necessary for efficient citizenship

as the ability to read and write”

Herbert George Wells

Featured

# Two is a crowd

When it comes to networking in Biostatistics, the well-known rule of the 6 degrees of separation seems to get narrower.

Intrigued by Michael Salter-Townshend´s article in the last month´s Significance Big Data Special Issue, I tried the InMaps Linkedin application for both my profile and Biostatnet´s (with the permission of  its main researchers).

At first glance, it can be noticed that there are obvious differences between the two of them, most probably due to the fact that mines includes friends and family that are not necessarily linked to the field of Biostatistics, and therefore does not show such a clear conglomerate of mutually linked connections (or small world network), rather being divided in two main clusters (forming a sort of scale-free network): one that could be identified with my social life and previous studies (dark turquoise), and the other one (rest of colours) intimately related to my  current employment. It is also worth noticing that the coloured clusters in Biostatnet´s map are not necessarily associated to the nodes that constitute the network, but to the different areas of study (clinical, applied,…) instead. This clearly reflects the multidisciplinary nature of an area of study that requires of other fields such as Biology, Computing, Mathematics and Medicine for its successful development.

However, the importance of these maps does not just lie in the identification of clusters but in the potential for inferring further information from them. As a matter of fact, it has been shown that the often criticized social networks, can not only help us when bored or looking for a job, but do also encourage and make interdisciplinarity easier, and provide researchers with essential information for the study of scientific phenomena such as the spread of epidemics, since this is very often determined/affected by social interaction (see papers by Liu and Xiao and Corner et al.). This also applies to the study of the distribution of species in ecological niches whose analysis is certainly similar to that of social networks (see papers by Johnson et al and Coleing). It has been proved that those species that are involved in a trophic chain with more and better connections, will be more likely to survive should any changes in their environment happen.

In conclusion, it seems that when networking, two highly-connected contacts are already a crowd and provide much more information than we could ever imagine, so…let´s network!!

Have you tried with yours? Any surprises there? Have you used network analysis in your research? Tell us about it!!