Featured

# Interview with…Manuel G. Bedia

Manuel G. Bedia is an assistant professor in the Department of Computer Science at the University of Zaragoza. He is one of the founders of the Spanish Network of Cognitive Science (retecog.net). This network has been established to promote and coordinate research in Cognitive Systems with goals overlapping those of the European Network EUCognition but with more emphasis on the relationships between scientific and educational policies, and the Spanish university system. He holds a BSc in Physics, a MSc.in Technological Innovation management and a Ph.D. in Computer Science and Artificial Intelligence (Best PhD Thesis Award, 2004), all from the University of Salamanca (Spain). He has worked as a Technological Consultant in Innovation and knowledge management (Foundation COTEC, Madrid, Spain) and as a research fellow in the field of artificial cognitive systems in the Department of Computer Science at the University of Salamanca, the Planning and Learning Group at the University Carlos III of Madrid (Visiting Professor, 2005-07) and the Multidisciplinary Institute at the University Complutense of Madrid (2004-07). He has also been a visiting postdoctoral researcher at the Institute of Perception, Action and Behavior (University of Edinburgh, 2005) and the Centre for Computational Neuroscience and Robotics at the University of Sussex, 2008.

1.     Your area of research is Cognitive Sciences. Could you give us a brief introduction about the focus of your work?

Cognitive science is a space for interdisciplinary research where we aim to understand how the mind works. It joins together neuroscientists, psychologists, philosophers, engineers and of course statisticians too!

During the past five decades, analogies between the human mind/brain and computer software/hardware have led the work of researchers trying to understand how we think, reason and solve problems.

However, over the last few years, new conceptions have arisen doubting this conceptualisation. The biggest influence behind this change in perspective has come from engineers rather than scientists; in particular a group of engineers using the disciplinary tools of engineering to generate new scientific hypotheses instead of applying knowledge generated from other areas.

In a reversal of the usual role of engineers using models for the development of artifacts, the process develops tools to think about mind phenomena.

2. Could you give us an example of this?

Imagine we purposefully build a very simple artifact or software program that is capable of performing a certain task in a novel way. This proves the existence of explanatory alternatives to phenomena that were supposed to work in a certain way. In the words of other authors, the models serve as “mental gymnastics”. They are entities equivalent to classical mental experiments: They are artifacts that help our thinking. These tools are the foundations of modelling exercises: dynamic systems, probability theory, etc.

3. Is probability an important tool in your work?

It is indeed very important and relevant at many levels of the research in this area.

At a fundamental level, the mathematical languages that the early Artificial Intelligence (AI) researchers developed were not sufficiently flexible (they were based on the use of logic and rule systems) to capture an important characteristic of our intelligence: its flexibility to interactively reorganise itself. This led to a growing interest in tools that would embrace this uncertainty.

Recently a very interesting approach has been developed in the area where fundamental principles are based on probability: Artificial General Intelligence (AGI). The original goal of the AI field was the construction of “thinking machines” – that is, computer systems with human-like general intelligence. Due to the difficulty of this task, for the last few decades, the majority of AI researchers have focused on what has been called “narrow AI” – the production of AI systems displaying intelligence regarding specific, highly constrained tasks. In recent years, however, more and more researchers have reapplied themselves to the original goals of the field recognising the necessity and emergent feasibility of treating intelligence holistically. AGI research differs from the ordinary AI research by stressing the versatility and entirety of intelligence. Essentially, its main objective was to develop a theory of Artiﬁcial Intelligence based on Algorithmic Probability (further explanations can be found here).

At a more concrete level, there are several examples. For instance, it is well known that the reasoning model of the clinical environment is fundamentally Bayesian. The clinicians analyse and reflect on previous conditions and status of patients, before reaching a diagnosis of their current condition. This fits very well with the whole idea of Bayesian probability. Following the same line of reasoning, probability appears as a fundamental tool to model artificial minds thinking as humans.

In general, this Bayesian framework is the most used in our field.

4. How can this be applied in your area of research?

The Bayesian framework for probabilistic inference provides a general approach to understanding how problems of induction can be solved in principle, and perhaps how they might be solved in the human mind. Bayesian models have addressed animal learning , human inductive learning and generalisation, visual perception, motor control, semantic memory , language processing and acquisition , social cognition, etc.

However, I believe that the most important use comes from the area of neuroscience.

5. So what is the neuroscientific viewpoint in the field of the understanding of our mental functions, the Cognitive Sciences?

Neuroscience intends to understand the brain from the neural correlates that are activated when an individual performs an action. The advances in this area over the years are impressive but this conceptual point of view is not without problems. For instance, as Alva Noë states in his famous book Out of Our Heads, the laboratory conditions under which the measurements are taken substantially affect the observed task…This is a sort of second order cybernetics effect as defined by Margaret Mead decades ago. The history of neuroscience also includes some errors in the statistical analysis and inference phases…

6. Could you explain this further?

In the early 90s, David Poeppel, when researching the neurophysiological foundations of speech perception, found out that none of the six best studies of the topic matched his methodological apparatus (read more here).

Apparently, these issues were solved when functional magnetic resonance imaging (fMRI) emerged. As this technique was affordable it allowed more groups to work on the topic and indirectly forced the analytical methods to become more standardised across the different labs.

However, these images brought in a new problem. In an article in Duped magazine Margaret Talbot described how the single inclusion of fMRI images in papers had arguably increased the probability of these being accepted.

7.  You have also mentioned that big mistakes have been identified in the statistical analysis of data in the area. What is the most common error in your opinion?

In 2011 an eye-opening paper was published on this topic (find it here). The authors focused their research on the misreported significance of differences of significance.

Let’s assume one effect is statistically significantly different from controls (i.e. p<0.05), while another is not (p>0.05). On the surface, this sounds reasonable, but it is flawed because it doesn’t say anything about how different the two effects are from one another. To do this, researchers need to separately test for a significant interaction between the two results in question. Nieuwenhuis and his co-workers summed up the solution concisely: ‘…researchers need to report the statistical significance of their difference rather than the difference between their significance levels.’

The authors had the impression that this type of error was widespread in the neuroscience community. To test this idea, they went hunting for ‘difference of significance’ errors in a set of very prestigious neuroscience articles.

The authors analysed 513 papers in cognitive neurosciences in the five journals of highest impact (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience). Out of the 157 papers that could have made the mistake, 78 use the right approach whereas 79 did not.

After finding this, they suspected that the problem could be more generalised and went to analyse further papers. Out of these newly sampled 120 articles on cellular and molecular neuroscience published in Nature Neuroscience between 2009 and 2010, not a single publication used correct procedures to compare effect sizes. At least 25 papers erroneously compared significance levels either implicitly or explicitly.

8. What was the origin of this mistake?

The authors suggest that it could be due to the fact that people are generally tempted to attribute too much meaning to the difference between significant and not significant. For this reason, the use of confidence intervals may help prevent researchers from making this statistical error. Whatever the reasons behind the mistake, its ubiquity and potential effect suggest that researchers and reviewers should be more aware that the difference between significant and not significant events is not itself necessarily significant.

I see this as a great opportunity and a challenge for the statistical community, i.e., to contribute to the generation of invaluable knowledge in the applied areas that make use of their techniques.

Selected publications:

Bedia, M. & Di Paolo (2012). Unreliable gut feelings can lead to correct decisions: The somatic marker hypothesis innon-linear decision chains. FRONTIERS IN PSYCHOLOGY. 3 – 384, pp. 1 – 19 pp. 2012. ISSN 1664-1078

Aguilera, M., Bedia, M., Santos, B. and Barandiaran, X. (2013). The situated HKB model: How sensorimotor spatialcoupling can alter oscillatory brain dynamics. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE. 2013. ISSN 1662-5188

De Miguel, G and Bedia, M.G. (2012). The Turing Test by Computing Interaction Coupling. HOW THE WORLD COMPUTES: TURING CENTENARY CONFERENCE AND 8TH CONFERENCE ON COMPUTABILITY IN EUROPE, CIE 2012. Cambridge, ISBN 3642308694

Santos, B., Barandiaran, X., Husband, P., Aguilera, M. and Bedia, M. (2012). Sensorimotor coordination and metastability in a situated HKB model. CONNECTION SCIENCE. 24 – 4, pp. 143 – 161. 2012. ISSN 0954-0091

Featured

# Spatial objects in R (I)

In this post I would like to offer a brief and practical introduction to different types of spatial data that we could handle in R and how to do to access and visualize them. It might be interesting for those who are beginners in spatial analysis.

The structure of spatial objects is complex, and it is very important to know how these are stored and organized. sp package provides a clear form to organize the spatial data through classes (to specify the structure ) and methods (particular functions for a special data class).

The basic type of spatial objects is Spatial class, in which we could recognize two principal components, called slots: the first one is a bounding box that contains a matrix with the coordinate values. The second is a CRS class object, the reference system for these coordinates (“longlat” for geographical coordinates, “utm” for UTM coordinates, ect). With the command getClass(“Spatial”) (once sp package has been installed), we can examine the different types of spatial objects and their subclasses.

I will use an example to make this something practical, specifically we use a Spanish cartography (“ESP_adm.zip”, downloaded in this web site: http://www.gadm.org/). To use the ‘readShapeSpatial’ function to read the .shp object, we need the library ‘maptools’.

library(maptools)
# map is a SpatialPolygonsDataFrame, a subclass of a SpatialPolygons object.


A Spatial Polygons object is a closed line, a sequence of point coordinates where the first point is the same as the last point. It is not easy to know their structure, let us look at this briefly:

# examine the structure of this SpatialPolygonsDataFrame object.
# Each slot (5) is determined by ‘@name_slot’
str(map, max.level = 2)


But, what means each slot?

• @data: a data frame which contains information about each polygon. (For instance, the name of provinces, autonomous community,…). To access:
# The provinces:
map@data$NAME_2 # or slot(map, ‘data’)$NAME_2
# Idem for autonomous community with ‘NAME_1’  • @polygons: a list of 51 polygons (all provinces), of class Polygon. Each of those polygons have 5 slots, we can see the structure of the first polygon with:  str(map@polygons[[1]]@Polygons)  • @plotOrder: a vector of the plotting order of the 51 polygons. • @bbox: two-dimensional matrix with the maximum and minimum of the coordinates. • @proj4string: the coordinate reference system (CRS class). Finally, I will give just a simple example of how to plot some geographical coordinates in the ‘map’. For this, we read a data file with the locations and associate a spatial structure creating a SpatialPoints class. We need to be careful with the reference system of the SpatialPolygons and the SpatialPoints we used. The coordinate reference system for our shapefile is latitude/longitude and the WGS84 datum. Thus, we need the same CRS to our SpatialPoints. But…what if we do not have geographic coordinates? Let’s see it! Assume that the coordinate reference system of the SpatialPoints is UTM (Universal Transverse Mercator) instead of latitude/longitude. loc_utm <- read.csv(“loc_utm.csv”,header=T)  > loc_utm X Y 1 718970.0 4376275.0 2 505574.2 4793683.2 3 276066.1 4542666.9 4 538135.4 4750010.9 There are four locations situated (in order) in Valencia, Bilbao, Salamanca and Santiago de Compostela. The first three points are in UTM zone 30 and the last one in zone 29. It is important to know this to define the CRS correctly: # We separate these points according to the zone: loc1 <- loc_utm[1:3,] loc2 <- loc_utm[4,] # To create a SpatialPoints: coordinates(loc1) <- c("X","Y") coordinates(loc2) <- c("X","Y") # To define the CRS: proj4string(loc1) <- CRS("+proj=utm +zone=30 +ellps=WGS84") proj4string(loc2) <- CRS("+proj=utm +zone=29 +ellps=WGS84") # spTransform function provide transformation between projections. library(rgdal) loc1_geo<- spTransform(loc1, CRS("+proj=longlat +ellps=WGS84")) loc2_geo<- spTransform(loc2, CRS("+proj=longlat +ellps=WGS84"))  Now we can plot these points over the Spanish mapping. plot(map) plot(loc1_geo,add=T) plot(loc2_geo,add=T)  This post has been a short summary of some of Spatial objects that we can manipulate in R. In the following posts I will invite you to continue learning about this area of statistics. I hope it helps someone! Featured # Working with Joint Models in R Following previous posts on longitudinal models (see posts) and interviews where mentions have been made to survival analysis (read one of these interviews), we will focus today on models incorporating both methodologies, the so-called Joint Models. Often problems pose more than a single approach. For example, when HIV data are analyzed, we can on the one hand focus on studying the evolution of CD4 cell counts (longitudinal model) and on the other hand, we can also study the time to death (survival analysis model). Joint modeling will allow both models to share information and will achieve better results. From a practical point of view, in R, we can use two packages called JM and JMbayes to fit these models, under either the frequentist or the Bayesian paradigm, respectively. The use of these packages is not difficult because they are based on the use of packages nlme and survival with which you might already be familiar. Here I show you how to use them: • We loaded packages JM and JMbayes library(JM) library(JMbayes)  • We fit the longitudinal and survival models independently, saving the results in objects # Fit the longitudinal model lmeFit.aids <- lme(CD4 ~ obstime * drug, random = ~obstime | patient, data = aids) # Fit the survival model survFit.aids <- coxph(Surv(Time, death) ~ drug, data = aids.id, x = TRUE)  • We use the objects lmeObject and survObject with the functions in the packages JM and JMbayes # Fit the joint model (Frequentist) jointFit.aidsFREQ <- jointModel(lmeFit.aids, survFit.aids, timeVar = "obstime") # Fit the joint model (Bayesian) jointFit.aidsBAYES <- jointModelBayes(lmeFit.aids, survFit.aids, timeVar = "obstime")  It’s possible to specify different characteristics about the joint model. With JM package various options for the survival model are available. With JMbayes it is possible to choose the type of association structure between the longitudinal and survival processes. Finally, if you are interested in these models and are thinking about using them in your research, a good reference is the book Joint Models for Longitudinal and Time-to-Event Data: With Applications in R of Dimitris Rizopoulos, author of JM and JMbayes. Note: The databank to fit the longitudinal model must have as many rows as observations per patient have (aids). On the other hand, the survival model requires a database that has only one observation for each patient over time to death and censoring indicator (aids.id). Featured # Have an animated 2014! Animations can be a refreshing way to make a website more attractive, as we did exactly a year ago here. Based on the Brownian motion simulated here, and using animation and ggplot2 R packages, we produced a fun welcome to the International year of Statistics, Statistics2013. The function saveMovie (please notice that there exists another alternative, saveGIF) allowed us to save it and finally publish it, as easy as that! For this new year we have based our work on the demo(‘fireworks’) from the 2.0-0 version of the package animation (find here a list of the changes in each version) and world from the package maps (speaking of maps, have a look at this great dynamic map by Bob Rudis!). They also come very handy when trying to represent visually a dynamic process, the evolution of a time-series,etc. As an example, in a previous post we portrayed an optimisation model in which the values for the mean were asymptotically approaching the optimal solution, which was achieved after a few iterations. This was done with a for loop and again packages ggplot2 and animation. There are many other examples of potential applications both in general Statistics – see this animated representation of the t distribution and these synchronised Markov chains and posterior distributions plots- and in Biostatistics. Some examples of the latter are this genetic drift simulation by Bogumił Kamiński and this animated plots facility in Bio7, the integrated environment for ecological modelling. R package caTools also allows you to read and write images in gif format. Finally, in LaTeX, \animategraphics from the animate package will do the trick; check this post by Rob J. Hyndman for further details. It is certainly one of our new year’s resolutions to incorporate more animations in our posts, what are yours? Featured # Interview with… Anabel Forte Anabel Forte Deltell, graduated in mathematics and statistics, holds a PhD in Statistics from the Univeristy of Valencia. Now she is a lecturer at the economics department of Universitat Jaume I at Castellón. Her research topics cover many fields of statistics as, for instance, spatial analysis or joint modeling of longitudinal (or panel) data and time-to-event data with her main interest beeing Bayesian model selection and, in particular, Objective Bayesian variable selection. Her dissertation (2011) entitled “Objective Bayes Criteria for Variable Selection” focuses on this last topic. She has published various articles in international journals and participated in many national and international meetings. Email: forte@uji.es 1. Why do you like Biostatistics? Researching in biostatistics is researching in real life and it makes me feel like I’m helping to improve people´s life conditions. 2. Could you give us some insight in your current field of research? Despite my main line of research being Bayesian model selection, I’m actually working in joint modeling of longitudinal data and time-to-event data. Nowadays many practitioners are moving to what it is called personalized medicine with medical decisions, practices, and/or products being tailored to the individual patient. In this scenario longitudinal data seems to be of great importance since it allows for the consideration of several sources of uncertainty including patient-specific variability. Moreover, quantifying the risk of suffering a certain event of a patient given its trajectory seems really sensible and can be done using joint modeling. 3. Which are, in your opinion, the main advantages of being a researcher? For me, being a researcher is simply a way of life. I can not imaging myself doing anything else. 4. Your whole professional experience has been within the public sector and the University. How do you see the present and future of research in the Spanish public sector? From my point of view if things do not change greatly, researching in Spain, at least in the public sector, has a really black future. As I see it, the only way researchers can manage to get money is to call attention of private investors. But private money can compromise the objectivity of research and at the same time avoid researching in some fields that are not so “attractive” for industry. Hence it is something that has to be studied carefully… but something has to be done, that´s for sure. 5. What do you think of the situation of young biostatisticians in Spain? As I see it, young biostatisticians have to make a great effort to show their enormous value for private companies as, for instance, food processing companies or health care related companies among others… And I think so because, pitifully, nowadays the Spanish public system can not assume all the biostatistical research that is needed. 6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician? For me it is really important to have a good mathematical/probabilistic base and to be well organized when working. But the most important of all is enjoying research. 7. Which do you think are the main qualities of a good mentor? For me, a good mentor is someone that supervises your work and is always there for you but at the same time he or she should give you some space for you to fail and learn… in other words someone that teaches you what researching really is. Selected publications: • Francisco Pozo-Rodríguez; Jose Luis López-Campos; Carlos J. Álvarez-Martínez; Ady Castro-Acosta; Ramón Agüero; Javier Hueto; Jesús Hernández-Hernández; Manuel Barrón; Victor Abraira; Anabel Forte; Juan Miguel Sanchez Nieto; Encarnación Lopez-Gabaldón; Borja G. Cosío; Alvar Agusti. Clinical Audit of COPD patients requiring hospital admissions in Spain: AUDIPOC Study.Plos One. 7 – 7, (USA): 2012. • M. J. Bayarri; J. O. Berger; A. Forte; G. García-Donato. Criteria fo Bayesian model choice with application to variable selection. Annals of Statistics. 40 – 3, pp. 1550 – 1577. (USA): 2012. • C. Armero; A. Forte; A. López-Quílez. Geographical variation in pharmacological prescription. Mathematical and Computer Modelling. 50, pp. 921 – 928. (UK): 2009. • Allepuz; A. López-Quílez; A. Forte; G. Fernández; J. Casal. Spatial Analysis of Bovine Spongiform Encephalopathy in Galicia, Spain (2000-05). Preventive Veterinary Medicine. 79 – 2, pp. 174 – 185. (Holland): 2007. • López-Quílez; C. Armero; A. Forte. Geographical variation of Pharmacological prescription with Bayesian Hierarchical models.Value In Health. 10 – 6, (USA): 2007. Featured # Hasta la vista, Statistics2013! After twelve months filled with exciting events celebrating the great role of statistics in society, it is now time to say goodbye to the International Year of Statistics, Statistics2013. Illustration: Cinta Arribas From FreshBiostats we would like to thank the steering committee for making of this such a fantastic opportunity for professionals and students at all levels to interact and share the pride in what we do. We would like to send a special thank you to Ronald Wasserstein and Jeffrey Myers who helped us feel involved in the celebration with our small contributions to their newsletter and blog. The good news is that it seems like there will be a new project starting in late January. We are really looking forward to The World of Statistics! – find here a whole lot of new activities already arranged for 2014-. Featured # How to make an R package repository: an R-recipe It is well known that R is one of the most used statistical package by the data scientists (researchers, technicians, etc..). This package allows flexibility to adapt already developed functions to their own interests. However, the problem comes when one needs to perform a complex algorithm that requires strong programming level or even to show to worldwide scientists how a new statistical technique works. When this scenario is present, it means that it is time to create your own package in R. The most In this post we will learn how to make a simple R package only following several steps which are explained in the next paragraphs: 1. Update R version and other relevant tools. First of all, if you are planning to develop a R package, it is highly recommended to update your R version to newest one. One must take into account that R versions are continuously updating so that your package would be likely to not to work properly. Furthermore, the package called Rtools is necessary to install in your system. This package has several add-ins (such as MinGW compiler) which are helpful for the development of the package. 2. Get to know your Operative System For Windows System, you have to add to the PATH system variable the whole path where R is installed. For instance, if R is installed in c:, the PATH can be fixed as: PATH=%PATH%;c:\Rtools\bin;c:\Rtools\perl\bin;c:\Rtools\MinGW\bin; c:\texmf\miktex\bin;c:\R-2.15.0\bin; 3. Decide the name and the kind of the package wanted to build. This step is the most important. First of all, the package name can only be composed by numbers and letters and it is highly recommended to start it with a letter. When building the functions belonging to the package, it is very important the arrangement of the files where the functions are written. There are two possible situations: (1) all functions in a file; (2) each function in its own file. Both options are extreme since (1) writing all functions in a file could overload the system and (2) writing many functions in its own files implies having multiple files that should be packed. So, in order to search an alternative to these options, I suggest grouping related functions into the same files. 4. Edit some files Once the previous steps are done, several files should be filled out/edited so that any R user who wants to use the functions within the package is able to understand what the function does. First of all, it is obligatory to fill in a file named DESCRIPTION. It is used to describe the package, with author, and license conditions in a structured text format that is readable by computers and by people. Secondly, a tutorial of the package (called as VIGNETTES) should be written. An extended description with reproducible examples shows the performance of each function within the repository. 5. Building, installing and executing the package As final step, you have to mix all the components in a file by means of different software commands. Using the package.skeleton() function (see below) one can create the desired mypkg package: package.skeleton(list=c(“f”,”g”,”h”,”i”),name=”mypkg”) where f, g, h, i¸ are all the functions belonging to the repository. Once mypkg is created, the next step is to install it in our R system. You can use the common ways to perform it but I would advise using R commands to do that (R CMD INSTALL). Load the package and execute it. It is time to enjoy it!! This has been a summary of the steps to follow for a R package repository development. Now it is your challenge!!! Merry Christmas and Happy New Year! Featured # Analysis of PubMed search results using R Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year… Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine. Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,… In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980: library(RISmed) query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing" ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000) QueryCount(ngs_search) ngs_records <- EUtilsGet(ngs_search) years <- Year(ngs_records) ngs_pubs_count <- as.data.frame(table(years))  This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year: I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal: journal <- MedlineTA(ngs_records) ngs_journal_count <- as.data.frame(table(journal)) ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]  Again, the complete code that allows us to normalize the data by the total number of publications per journal, as well as the following barplots showing the result, is available at our Gist: You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs. Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog. Search and analyze carefully! Featured # My impressions from the Workshop on the Future of the Statistical Sciences An exciting workshop, organised as the capstone of the International Year of Statistics Statistics2013, was celebrated on the 11th and 12th of November at the Royal Statistical Society Headquarters in London, under the stimulating title of “The Future of the Statistical Science”, science to which John Pullinger, president of the Royal Statistical Society, refers to as a quintessentially, multidisciplinary discipline. The workshop was indeed a good example of that. Online attendance was welcomed to the event giving an additional opportunity to listen to highly reputed professionals in main areas of Statistics. Virtual participants were also allowed to pose their questions to the speakers, an innovation that worked very well as an excellent way to make these sorts of events available to a wider audience that would otherwise be excluded – I am already looking forward to more! -. In case you missed it at the time, luckily the podcast is still available here. True to its description, the event covered different areas of application, it showed the tools of the field for tackling a wide range of challenges, it portrayed potentially high impact examples of research, and in summary, was a great attention-grabbing exercise that will hopefully encourage other professionals to the area. A common vision of ubiquity and great relevance came across from the speeches and showed the field as a very attractive world to join. See my simplified summary below in form of a diagram certainly reflective of the great Statistics Views motto “Bringing Statistics Together”. Fig 1. Some of the ideas discussed in the workshop Particularly relevant to this blog and my particular interests were presentations on environmental Statistics and statistical Bioinformatics, as well as other health-related talks. Important lessons were taken on board from the rest of the speakers too. The workshop started with a shared presentation on the hot topic “Statistical Bioinformatics” by Peter Bühlmann (ETH Zürich) and Martin Vingron (Free University of Berlin). In a fascinating talk, Bühlmann argued for the need of uncertainty quantification in an increasingly heterogeneous data world –examples of this are also appearing in other fields, e.g. in the study of autism disorders as Connie Kasari and Susan Murphy mentioned in “SMART Approaches to Combating Autism”- and for models to be the basis of the assignation of uncertainties – topic greatly covered by Andrew Gelman in “Living with uncertainty, yet still learning”-, while acknowledging that in the big data context, “confirmatory statistical inference is (even more) challenging”. Vingron followed by focusing on “Epigenomics as an example”, raising open questions to the audience on how to define causality in Biology where “most processes […] are feedback circular processes”, calling for models that are just complex enough so as to allow for mechanistic explanations, and for good definitions of null hypotheses. In addition to the interesting points of the talk, I found its title particularly attractive in what it could be directly linked to a vibrant roundtable on “Genomics, Biostatistics and Bioinformatics” in the framework of the 2nd Biostatnet General Meeting, in which, as some of you might remember, the definitions of the terms Biostatistics and Bioinformatics were discussed. I wonder if the term “statistical Bioinformatics” would be indeed the solution to that dilemma? As a matter of fact, Bühlmann himself mentions at the start of his talk other options like Statistical Information Sciences, etc- Michael Newton and David Schwartz from the University of Wisconsin-Madison also focused on the triumph of sequencing and “Millimeter-long DNA Molecules: Genomic Applications and Statistical Issues”. Breast cancer being one of the mentioned applications, this was followed by an introduction to “Statistics in Cancer Genomics” by Jason Carroll and Rory Stark (Cancer Research UK Cambridge Institute, University of Cambridge), particularly focusing on breast and prostate cancer and the process of targeting protein binding sites. The latter, as a computational biologist, mentioned challenges on the translation of Statistics for the biologists and vice versa, and identified ”reproducibility as (the) most critical statistical challenge” in the area – also on the topic of reproducibility, it is especially worth watching “Statistics, Big Data and the Reproducibility of Published Research” by Emmanuel Candes (Stanford University) -. Another area increasingly gaining attention in parallel with technologies improvements is Neuroimaging. Thomas Nichols (University of Warwick) lead the audience in a trip through classical examples of its application (from Phineas Gage´s accident to observational studies of the structural changes in the hippocampi of taxi drivers, and the brain´s reaction to politics) up to current exciting times, with great “Opportunities in Functional Data Analysis” which Jane-Ling Wang (University of California) promoted in her talk. Also in the area of Health Statistics, different approaches to dietary patterns modelling were shown in “The Analysis of Dietary Patterns and its Application to Dietary Surveillance and Epidemiology” by Raymond J. Carroll (Texas A&M University), and Sue Krebs-Smith (NCI), with challenges in the area being; finding dietary patterns over the life course – e.g. special waves of data would appear in periods like pregnancy and menarchy for women-, and incorporation of technology to the studies as a new form of data collection –e.g., pictures of food connected to databases-. Again with a focus on the challenges posed by technology, three talks on environmental Statistics outlined an evolution over time of the field. Starting from current times, Marian Scott (University of Glasgow) in “Environment – Present and Future” stated in her talk that “natural variability is more and more apparent” because we are now able to visualise it. This idea going back to the heterogeneity claimed by Bühlmann. Despite the great amounts of data being available, and initially caused by technological improvements but ultimately due to a public demand – “people are wanting what it is happening now”-, the future of the field is still subject to the basic questions: “how and where to sample.” Especially thought-provoking were Scott´s words on “the importance of people in the environmental arena” and the need for effective communication: “we have to communicate: socio-economic, hard science…, it all needs to be there because it all matters, it applies to all the world…” Amongst other aims of this environmental focus such as living sustainably – both in urban and rural environments-, climate is a major worry/fascination which is being targeted through successful collaborations of specialists in its study and statisticians. “Climate: Past to Future” by Gabi Hegerl (University of Edinburgh) and Douglas Nychka (NCAR) covered “Regional Climate – Past, Present and Future” and showed relevant examples of these successful collaborations. Susan Waldron (University of Glasgow), under the captivating speech title of “A Birefringent Future”, highlighted the need to address challenges in the communication of “multiple layers of information” too, Statistics appearing as the solution for doing this effectively –e.g. by creating “accessible visualisation of data” and by incorporating additional environmental variation in controversial green energy studies such as whether wind turbines create a microclimate-. Mark Hansen (Columbia University Journalism School) seconded Waldron´s arguments and called for journalists to play an important role in the challenge: “data are changing systems of power in our world and journalists. […] have to be able to think both critically as well as creatively with data”, so as to “provide much needed perspective”. Examples of this role are; terms such as “computational Journalism” already being coined and data visualisation tools being put in place -as a fascinating example, the New York Times´ Cascade tool builds a very attractive representation of the information flow in Social Media-. David Spiegelhalter (Cambridge University) also dealt with the topic in the great talk “Statistics for an Informed Public”, providing relevant examples (further explanation on the red meat example can be found here as well). To encourage “elicitation of experts opinions (to reach a) rational consensus” and “to spend more time with regulators” came up as other useful recommendations in the presentation “Statistics, Risk and Regulation” by Michel Dacorogna (SCOR) and Paul Embrechts (ETH Zürich). In a highly connected world inmersed in a data revolution, privacy becomes another major issue. Statistical challenges arising from networked data –e.g. historical interaction information modelling- were addressed by Steve Fienberg (Carnegie Mellon University) and Cynthia Dwork (Microsoft Research), who argued that “statisticians of the future will approach data and privacy (and its loss) in a fundamentally algorithmic fashion”, in doing so explicitly answering the quote by Sweeney: “Computer science got us into this mess, can computer science get us out of it?”. Michael I. Jordan (University of California-Berkeley) in “On the Computation/Statistics Interface and “Big Data”” also referred to “bring(ing) algorithmic principles more fully into contact with statistical inference“. I would like to make a final mention to one of the questions posed by the audience during the event. When Paul Embrechts was enquired about the differences between Econometrics and Statistics, a discussion followed on how crossfertilisation between fields has happened back and forth. As a direct consequence of this contact, fields rediscover issues from other areas. For instance, Credit Risk Analysis models were mentioned as being inferred from Medical Statistics or back in Peter Bühlmann´s talk, links were also found between Behavioural Economics and Genetics. These ideas, from my point of view, bring together the essence of Statistics, i.e. its application and interaction with multiple disciplines as the foundations of its success. Many other fascinating topics were conveyed in the workshop but unfortunately, I cannot fit here mentions to all the talks. I am sure you will all agree it was a fantastic event. Featured # Setting up your (Linux biostatistical) workstation from scratch Facundo Muñoz, MSc in Mathematics, PhD in Statistics from University of Valencia. He is currently a postdoc researcher at the french Institute National de la Recherche Agronomique (INRA). His main research field is spatial (Bayesian) statistics applied to environmental and biostatistical problems. He is currently working on statistical methodologies for the analysis of forest genetic resources. Being a busy biostatistician, I spend plenty of time glued to my computer. As an immediate consequence, every once in a while I need to set up my working station from zero. Either when I change job (I have done twice this year!), or when I want to update my OS version (upgrades rarely go perfect), or when I get a new laptop. This involves, you know, installing the OS, the main programs I need for work like R and LaTeX, some related software like a good Control Version System, a couple of Integrated Development Environments for coding and writing, and a dozen of other ancillary tools that I use every now and then. Furthermore, I need to configure everything to run smoothly, set up my preferences, install plugins, and so on. Last time I did this manually, I spent a week setting everything up, and in the following days I always had something missing. Then I thought I should have got this process scripted. Last week I set up my working environment in my new job. In a few hours I had everything up and running exactly the way I like. I spent an aditional day updating the script with new software, updated versions, and solving some pending issues. I thought this script might be useful for others as well, hence this post. It is version-controled in a google code repository, where you can download the main script. It is not very general, as installation details changes a lot from system to system. I use Linux Mint, but I believe it should go pretty straightforward with any derivative of Ubuntu, or Ubuntu itself (those distros using the APT package management). Other Linux branches (Arch, RedHat, Suse, Mac’s Darwin) users would need to make significant changes to the script, but still the outline might help. If you use Windows, well… don’t. Of course, you will not be using the same software as I do, nor the same preferences or configurations. But it might serve as a guide to follow line by line, changing things to suit your needs. In particular, it provides an almost-full installation (without unnecessary language packages) of the very latest LaTeX version (unlike that in the repos), and takes care of installing it correctly. It also sets up the CRAN repository and installs the latest version of R. The script also installs the right GDAL and Proj4 libraries, which are important in case you work with maps in R or a GIS. Finally, it installs some downloadable software like Kompozer (for web authoring), the Dropbox application, and more. It scrapes the web in order to fetch the latest and right versions of each program. I hope it helps someone. And if you have alternative or complementary strategies, please share! Featured # FreshBiostats´ First Anniversary So here we are. It has been a year since we started this venture. The idea of a blog came up from one of our co-bloggers at the Jede II Conference in the summer of 2012. At first it sounded like a bit of a challenge, but who said fear? No doubt about it, the balance has been highly positive. We are all for sharing knowledge and resources that might be valuable for others, and from our humble perspective we sincerely hope it might have been of some use. It has certainly been so for us, both by getting insight into particular subjects when writing the different posts and by diving into new topics covered by our co-bloggers and invited posts. Twitter and facebook have also allowed us to encourage interaction with colleagues and other bloggers, and we can now say our social and professional networks have certainly become bigger and stronger! We have found it difficult at times to juggle our jobs and PhDs with writing our weekly posts but as we said in several occasions, we are passionate about our work, and firmly believe that, most of the time, the line between work and fun gets blurry. As we promised in a previous post, here is an infographic summarising this year of Fun & Biostatistics, enjoy! We have a very international audience with visits coming from 117 countries, and we are delighted to see that not only colleagues from our closest network are reading our entries. Since our participation on the International Statistics Year blog -Statistics2013- and after being mentioned in other blogs such as RBloggers and others, we have gained more visibility and some posts have become very popular (more than 1000 views for some!). Those posts focusing on R tips clearly take the cake, being the most visited. We guess they might be the most useful ones, as we are also big fans of other very practical blogs. However we like to cover all the aspects of our profession and even sometimes deal with more controversial or philosophical subjects… We will keep inviting people to share their knowledge and will encourage colleagues to get involved in the blog. Our second year resolution is to make an effort to make of this blog a more interactive tool. We count on you for that! Remember you can contact us with your comments, suggestions, and enquiries at freshbiostats@gmail.com Thank you so much for being there! Featured # Interview with…Jorge Arribas Jorge is a BSc Pharmacy from the UPV-EHU and he is currently a resident trainee in Microbiology and Parasitology at the H.C.U. Lozano Blesa in Zaragoza, working also on his PhD thesis. Email: Jarribasg(at)salud(dot)aragon(dot)es 1. Could you give us some insight in your current field of research? My PhD thesis focuses on the diagnosis of Hepatitis C Virus (HCV) infection by means of core antigen determination, and I am also collaborating in a line of research focusing on new treatments for H. pylori infection, funded by a FIS grant. The former, intends to perform a comparison with respect to the current technique in use for the diagnosis. The latter, analyses resistance to different Flavodoxin inhibitors. 2. Where does Biostatistics fit in your daily work? In both areas of research that I am working on, since they are essentially comparative studies against established techniques, and therefore require techniques to prove the significance – or lack of significance- of the improvements. 3. Which are the techniques that you or researchers in your area use more often? Statistical techniques such as sensitivity and specificity analysis, hypothesis testing (ANOVA, t-test). There is also a particular need in the area for techniques dealing with ordinal data. 4. As a whole, do you find Biostatistics relevant for your profession? A very important part of the speciality of Microbiology and Parasitology focuses on the research of new diagnostic methods, treatments, prevalence of antibiotic resistance, etc. Therefore, Biostatistics becomes extremely useful when comparing these novel approaches to previous ones. 5. Finally, is there any topic you would like to see covered in the blog? It would be great to see published some examples of statistical applications in my area of study. Selected publications: • J. Arribas, R. Benito , J. Gil , M.J. Gude , S. Algarate , R. Cebollada , M. Gonzalez-Dominguez , A. Garrido, F. Peiró , A. Belles , M.C. Rubio (2013). Detección del antígeno del core del VHC en el cribado de pacientes en el programa de hemodiálisis. Proceedings of the XVII Congreso de la Sociedad Española de Enfermedades Infeccionas y Microbiología Clínica (SEIMC). • M. González-Domínguez, C. Seral, C. Potel, Y. Sáenz, J. Arribas, L. Constenla, M. Álvarez, C. Torres, F.J. Castillo (2013). Genotypic and phenotypic characterisation of methicillin-resistant Staphylococcus aureus (MRSA) clones with high-level mupirocin resistance in a university hospital. Proceedings of the 23nd European Congress of Clinical Microbiology and Infectious Diseases and 28th International Congress of Chemotherapy. • M. González-Domínguez, R. Benito, J. Gil, MJ. Gude, J. Arribas, R. Cebollada, A. Garrido, MC. Rubio (2012). Screening of Trypanosoma cruzi infection with a chemiluminiscent microparticle immunoassay in a Spanish University Hospital. Proceedings of the 22nd European Congress of Clinical Microbiology and Infectious Diseases and 27th International Congress of Chemotherapy. Featured # FreshBiostats birthday and September-born famous statisticians With the occassion of the 1st birthday of FreshBiostats, we want to remember some of the great statisticians born in September and that have contributed to the “joy of (bio)stats”.  Gerolamo Cardano Pavia, 24 September 1501 – 21 September 1576 First systematic treatment of probability Caspar Neumann Breslau, 14 September 1648 – 27 January 1715 First mortality rates table Johann Peter Süssmilch Zehlendorf, 3 September 1707 – 22 March 1767 Demographic data and socio-economic analysis Georges Louis Leclerc (Buffon) Montbard, 7 September 1707 – Paris, 16 April 1788 Premier example in “geometric probability” and a body of experimental and theoretical work in demography Adrien-Marie Legendre Paris, 18 September 1752 – 10 January 1833 Development of the least squares method William Playfair Liff, 22 September 1759 – London, 11 February 1823 Considered the founder of graphical methods of statistics (line graph, bar chart, pie chart, and circle graph) William StanleyJevons Liverpool, 1 September 1835 – Hastings,13 August 1882 Statistical atlas – graphical representations of time series Anders Nicolai Kiaer Drammen, 15 September 1838 – Oslo, 16 April 1919 Representative sample Charles Edward Spearman London, 10 September 1863 – 17 September 1945 Pioneer of factor analysis and Spearman´s Rank correlation coefficient Anderson Gray McKendrick Edinburgh, September 8, 1876 – May 30, 1943 Several discoveries in stochastic processes and collaborator in the path-breaking work on the deterministic model for the general epidemic Maurice Fréchet Maligny, 2 September 1878 – Paris, 4 June 1973 Contributions in econometrics and spatial statistics Paul Lévy 15 September 1886 – 15 December 1971 Several contributions to probability theory Frank Wilcoxon County Cork, 2 September 1892 – Tallahassee, 18 November 1965 Wilcoxon rank-sum tests, Wilcoxon signed-rank test Mikhailo Pylypovych Kravchuk Chovnytsia, 27 September 1892- Magadan, 9 March 1942 Krawtchouk polynomials, a system of polynomials orthonormal with respect to the binomial distribution Harald Cramér Stockholm, 25 September 1893 – 5 October 1985 Important statistical contributions to the distribution of primes and twin primes Hilda Geiringer Vienna, 28 September 1893 – California, 22 March 1973 One of the pioneers of disciplines such as molecular genetics, genomics, bioinformatics,… Harold Hotelling Fulda, 29 September 1895 – Chapel Hill, 26 December 1973 Hotelling´s T-squared distribution and canonical correlation David van Dantzig Rotterdam, 23 September 1900 -Amsterdam, 22 July 1959 Focus on probability, emphasizing the applicability to hypothesis testing Maurice Kendall Kettering, 6 September 1907 – London, 29 March 1983 Random number generation and Kendall´s tau Pao-Lu Hsu Peking, 1 September 1910 – Peking, 18 December 1970 Founder of the newly formed discipline of statistics and probability in China It is certainly difficult to think of the field without their contributions. They are all a great inspiration to keep on learning and working!! Note: you can find other interesting dates here. Update: and Significance´s timeline of statistics here. Any author you consider particularly relevant? Any other suggestions? Featured # Infographics in Biostatistics Although the history of Infographics according to Wikipedia does not seem to mention Fritz Kahn as one of the pioneers of this technique, I would like to start this post mentioning one of the didactic graphical representations of this Jewish German doctor, who was highly reputed as a popular science writer of his time. Apart from his fascinating views of the human body in the form of machines and industrial processes, I am particularly attracted by his illustration below, summarising the evolution of life in the Earth as a clock in which the history of humans would not take more than a few seconds… Image extracted from the printed version of the article “Fritz Kahn, un genio olvidado” published in El País, on Sunday 1st of September 2013. What could be understood by some as a naive simplification of matters requires, in my opinion, a great deal of scientific knowledge and it is a fantastic effort to communicate the science behind very complex mechanisms. This and more modern infographic forms of visualisation represent an opportunity for statisticians –and more specifically biostatisticians-, to make our field approachable and understandable for the wider public. Areas such as Public Health (see here), Cancer research (find examples here and here), and Drug development (see here) are already using them, so we should not be ashamed to make of these “less rigorous” graphical representations an important tool in our work. Note: There are plenty of resources available online to design a nice infographic in R. For a quick peek into how to create easy pictograms, check out this entry in Robert Grant´s stats blog. Also, the wordcloud R package will help you visualising main ideas from texts… We will soon show a practical example of these representations in this blog, keep tuned! Featured # An example of Principal Components Analysis The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go! We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. You can load the Iris data and examine this data frame with: data(iris) str(iris); summary(iris[1:4]) pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(irisSpecies)+1)
mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)


We will use “prcomp” R function to carry out the analysis, which is similar to “princomp” function.

As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.

In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.

#To examine variability of all numeric variables
sapply(iris[1:4],var)
range(sapply(iris[1:4],var))
# maybe this range of variability is big in this context.
#Thus, we will use the correlation matrix
#For this, we must standardize our variables with scale() function:
iris.stand <- as.data.frame(scale(iris[,1:4]))
sapply(iris.stand,sd) #now, standard deviations are 1


Now applied the prcomp() function to calculate the principal components:

#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix
pca <- prcomp(iris.stand,scale=T)
#it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand)
#similar with princomp(): princomp(iris.stand, cor=T)
pca
summary(pca)
#This gives us the standard deviation of each component, and the proportion of variance explained by each component.
#The standard deviation is stored in (see 'str(pca)'):
cards;
F 13 43
M 16 62
F 19 140
F 20 120
M 15 60
M 18 95
F 22 75
;
run;

The SAS code shown above will create the desired data set called “one”. As gender variable is categorical – the first variable-,  its values are “M” for MALE and “F” for FEMALE. The dollar sign ($) is used when you have a text variable rather than a numerical variable (i.e., gender coded as M, F rather than as 1 denoting male and 2 denoting female). (b) Statistical analysis For the performance of a basic descriptive analysis, – percentages and mean computations – the next procedures should be executed: · For frequencies and percentage computation. proc freq data = one; tables gender; run; · Descriptive analysis for continuous variables. proc means data = one n mean std; var age weight; run; As it can be observed, it is really easy to remember which statistical procedure to use according to the type of variable: for categorical data proc freq procedure and proc means for continuous data. (c) Output Another important issue is the way the results are shown. The SAS statistical package has improved this section. Before the release of the version 9.0 – I think so!- , one had to copy and paste all the results from the output window to a WORD document in order to get a proper and saved version of the results. From version 9.0, things have changed: All results can be delivered to a PDF, RTF or even HTML format. As SAS user, I can say that this has been a good idea, since not only I no longer have to waste lots of time doing copy-paste, but also the not so useful results can be left unprinted. That has been a great leap!!! For example, if you want to have the results in PDF format, you should follow the next instructions: ods pdf; proc means data = one n mean std; var age weight; run; ods pdf close; This code generates a PDF file where shows the mean and standard deviation of the age and weight variable of the one data set. Because of its capabilities, this software package is used in many disciplines, including the medical sciences, biological sciences, and social sciences. Knowing the SAS programming language will likely help you both in your current class or research, and in getting a job. If you want to go further in SAS programming knowledge, The Little SAS Book by Lora Delwiche and Susan Slaughter is an excellent resource for getting started using SAS. I also used it when I started learning SAS. Now it is your turn!! Have a nice holiday! Featured # Mixed Models in R: lme4, nlme, or both? The topic of Mixed Models is an old-friend of this blog, but I want to focus today on the R code for these models. Amongst all the packages that deal with linear mixed models in R (see lmm, ASReml, MCMCglmm, glmmADMB,…), lme4 by Bates, Maechler and Bolker, and nlme by Pinheiro and Bates are probably the most commonly used -in the frequentist arena-, with their respective main functions lmer and lme. I am still unsure as to which one I would choose - if I had to -, but I am sharing here a summary of some of their capabilities, in case it can be of help: Model specification I will be using for all of the following examples the balanced dataset Orthodont from the package nlme, which reflects the change in an orthodontic measurement over time for a sample of 27 children (see Fig. 1). Fig. 1. Spaghetti plots. Distance vs age by gender – dataset Orthodont For simplicity´s sake, I will consider the following initial models with a simple random factor (please see ref. [3] for centering and further analysis): model_lmer<-lmer(distance~age+Sex+(1|Subject),data=Orthodont) model_lme<-lme(distance~age+Sex, random=~1|Subject,data=Orthodont)  Tests • lmer The results for t-tests and F-tests based on Restricted Maximum Lilkelihood (REML) can be found by using the following lines of code (you can add REML=FALSE to change this default setting): summary(model_lmer)  Fixed effects: Estimate Std. Error t value (Intercept) 17.70671 0.83391 21.233 age 0.66019 0.06161 10.716 SexFemale -2.32102 0.76139 -3.048 (Please notice that the reference category in Sex can be changed by using relevel). anova(model_lmer)  Analysis of Variance Table Df Sum Sq Mean Sq F value age 1 235.356 235.356 114.8383 Sex 1 19.034 19.034 9.2875 • lme Conditional t-tests and F-tests are used for assessing the significance of terms in the fixed effects in lme. Both F and t conditional tests results are based on REML conditional estimate of the variance. This option will be the default, but we can specify method=”ML” for Maximum Likelihood estimates. For the results from the conditional t-test testing the marginal significance of each fixed effect coefficient with the other fixed effects in the model, we can use: summary(model_lme)  Fixed effects: distance ~ age + Sex Value Std.Error DF t-value p-value (Intercept) 17.706713 0.8339225 80 21.233044 0.0000 age 0.660185 0.0616059 80 10.716263 0.0000 SexFemale -2.321023 0.7614168 25 -3.048294 0.0054 For the results from the conditional F-test testing the significance of terms in the fixed effects model (sequentially by default), we can use: anova(model_lme)  numDF denDF F-value p-value (Intercept) 1 80 4123.156 <.0001 age 1 80 114.838 <.0001 Sex 1 25 9.292 0.0054 (The argument type would also allow us to specify marginal F-tests). These conditional tests for fixed-effects terms require denominator degrees of freedom, which will be the focus of the next section. Degrees of freedom • lme The denominator degrees of freedom correspond to “the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs” [3]. It is worth noticing that the values for these degrees of freedom do not match those provided by other software procedures such as SAS PROC MIXED (see discussions on the topic here and here). Additionally to the denominator degrees of freedom aforementioned, conditional F-tests also require numerator degrees of freedom defined by the term (see output from previous section). • A good explanation regarding the reporting of degrees of freedom in lmer is given by the author of the package in this article (page 28). p-values • lme reports p-values (see previous output), whereas • lmer doesn’t but this has been justified by Bates. Random effects • lme allows for nested random effects in a very straightforward way (random=~1|a/b, where factor b is nested in a). Crossed random effects on the other hand, can be dealt with through “a combination of pdBlocked and pdldent objects” [3]. • Nested random effects can again be easily modelled in lmer (+(1|a/b)). Crossed random effects are handled in an easier way than in lme (+(1|a)+(1|b)). You can find further explanations in [2]. Random effects plots Two different approaches to the plotting of the random effects can be obtained through the following lines of code: • lme plot(ranef(model_lme))  • lmer qqmath(ranef(model_lmer))  Fig. 2. Random effects plots for model_lme and model_lmer. Residuals plots • lme allows to plot the residuals in the following ways: res_lme=residuals(model_lme) plot(res_lme) qqnorm(res_lme) qqline(res_lme) plot(model_lme)  Fig. 3. Residual plots for model_lme. • We can also plot the first two graphs in lmer, but the last line of code does not seem to work with this function. Correlation structure • We can easily incorporate correlation structures in lme. Mostly used for temporal correlation structures are corAR1, corCAR1 (autoregressive and continuous autoregressive correlation structures of order 1), and corCompSymm (compound symmetry). model_lme<-lme(distance ~ age + factor(Sex),random = ~ 1 | Subject, cor=corAR1(0.6,form=~1|Subject),data = Orthodont)  Correlation Structure: AR(1) Formula: ~1 | Subject Parameter estimate(s): Phi 0.05849311 Further structures available can be found in the help for corClasses. • There is not an argument in lmer for doing this, but guidance on how to incorporate the structures can be found here. Heteroscedasticity • lme allows you to model heteroscedasticity using the varFunc object, but • it is not covered by lmer. Although some other differences could be mentioned, future posts will cover those other matters. The following books have been found extremely useful: [1] Badiella, L., Sánchez, J.A. “Modelos mixtos con R versión 0.3” ( from the course “Modelos Mixtos utilizando R”) [2] Bates, D. (2010). lme4: Mixed-effects modelling with R. Springer [3] Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer [4] Zuur, A.F. et al. (2009). Mixed effects models and extensions in ecology with R. Springer I will be more than happy to read your comments and tips! Featured # Side effects of open health data Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point. At the forefront of these changes, reputable publications have taken measures to set transparency standards. Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012) In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid). Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only) However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10th of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28th of June 2013) . While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients. In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.” Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues… Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors. What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation? Featured # Interview with…Laetitia Teixeira Laetitia is a graduate in Applied Mathematics and she also has a Master’s degree in Applied Statistics and Modelling from the University of Porto, Portugal. At present, she is a PhD student on Applied Mathematics and she works in the Research Unit UNIFAI (Institute of Biomedical Sciences Abel Salazar, University of Porto, Portugal) 1. Why do you like Biostatistics? Biostatistics allows for the application of statistical theory to various areas of research practice and to work in various areas. Statistics and medicine are two areas of great interest for me and biostatistics allows me to work in both. 2. Could you give us some insight in your current field of research? My PhD work focuses on survival analysis in the presence of competing risks. All the practical work is based on end-stage renal disease patients with peritoneal dialysis as renal function replacement therapy. We explore several statistical approaches, such as regression models taking competing risks into account, multistate models and joint models for longitudinal and competing risks data. Using these approaches, we can give more and better information about the disease progression, helping clinicians in the evaluation of patients and treatment planning. Combined with my PhD, I am a fellow researcher at UNIFAI/ICBAS-UP, a Research Unit specialized in ageing and health. 3. Which are, in your opinion, the main advantages of being a researcher? The opportunity to work in several areas with multidisciplinary teams. 4. What do you think of the situation of young biostatisticians in Portugal? In Portugal, biostatisticians are mostly present in higher education institutions. Some public and private enterprises have been integrating some young biostatisticians, however in a very limited number. Some colleagues have gone to other European countries, where they have found better opportunities in this area. 5. What would be the 3 main characteristics or skills you would use to describe a good biostatistician? Interested in research, versatile and good communicator. 6. Which do you think are the main qualities of a good mentor? Motivator, interested in research and dedicated, good communicator. 7. Finally, is there any topic you would like to see covered in the blog? A list of working groups organized by research themes. This list would be important for young biostatisticians to find people according to working area and would allow students/researchers to create other networks. Selected publications: • Teixeira, L., Rodrigues, A., Carvalho, M.J., Cabrita, A. & Mendonça, D. (2013). Modeling Competing Risks in Nephrology Research: An Example in Peritoneal Dialysis. BMC Nephrology 2013, 14:110 doi:10.1186/1471-2369-14-110 • Cotovio, P., Rocha, A., Carvalho, M.J., Teixeira, L., Mendonça, D., Cabrita, A., & Rodrigues, A. (2013). Better Outcomes of Peritoneal Dialysis in Diabetic Patients in Spite of Risk of Loss of Autonomy for Hemodialysis. Accepted – Peritoneal Dialysis International. • Rocha, A., Rodrigues, A., Teixeira, L., Carvalho, M. J., Mendonça, D., & Cabrita, A. (2012). Temporal Trends in Peritonitis Rates, Microbiology and Outcomes: The Major Clinical Complication of Peritoneal Dialysis. Blood Purification, 33(4), 284-291. Featured # Between descriptive analysis and decision making When faced with a statistical analysis task, we usually begin with a simple description about our data that will allow us to analyze and interpret our variables, with the aim of making a decision on some hypotheses that had been made at the start of the study. This post deals with some statistical methods that are regarded as exploration techniques, but that go further than simple and usual descriptive. This talk will focus on correspondence analysis (CA) and principal components analysis (PCA), both are central to Multivariate Analysis. PCA and CA are usually applied in high dimensional datasets with the principal objective being to reduce the dimensions of the data. Although these methods have their own particularities, both conclude to explain latent variables in the problem through observed data. PCA: is widely used to capture essential patterns of big datasets. In high dimensional data sometimes it is difficult for researchers to extract interesting features, so one way to solve it is to reduce its dimensionality at the expense of losing information. It works by creating new uncorrelated variables $Y_i$ (named as PCA) through linear combinations of original variables $X_k$ (in general correlated). $Y_i = \alpha_1X_1 + \alpha_2X_2 + \ldots + \alpha_jX_j$ These PCA collect all information than original variables, and the goal is to select some PCA by preserving as much data variance as possible. CA: unlike PCA, this methodology is applied in categorical data (without calculate linear combinations) as a procedure to analyze contingency tables. CA allows us to describe the relation between two nominal variables as well as the relation between the levels of themselves in a Cartesian axis. The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis. The applications of PCA and CA are wide and varied, in fields such as biology, ecology, social sciences, psychology, image processing … in which the number of variables is big. As we have said before, in that situation the PCA and CA provide us a method to extract latent variables and population intrinsic characteristics that have not been observed, so that we can think in these as a hypothesis generation system. With ‘ca’ (‘mjca’) and ‘princomp’ packages of R we can apply Simple (Multiple) Correspondence Analysis and Principal Components Analysis into our data. The following figure illustrates a typical graphic of PCA, representing the first two components. Here I have briefly commented a little aspects of two procedures to describe large datasets. In following posts I will try to do an example using R. Meanwhile, you can try and play with ‘ca’ and ‘princom’ functions. Featured # How to manage longitudinal data with R? During the time that this blog has been running several posts about longitudinal data has been published. However, we have not talked yet about how we can deal with them with R. As we have discussed in other posts, longitudinal data gather information from a group of study subjects over time (repeated measures). When the number of measurements is the same for all subjects and these measurements are equidistant along time, it is said that we have a balanced data. Otherwise our data are called unbalanced. When working with either format in R, the joineR package that allows us to adapt our data. If our data are balanced we can move from one format to another according to the analysis we are interest in, simply by: # generate a (balanced) data simul <- data.frame(id = 1:3, Y.1 = rnorm(3), Y.2 = rnorm(3), age = runif(3,0,18)) # move it to an unbalanced format simul.unbal <- to.unbalanced(simul, id.col = 1, times = c(1,2), Y.col = 2:3, other.col = 4) # return data to a balanced format simul.bal <- to.balanced(simul.unbal, id.col = "id", time.col = "time", Y.col = c("Y.1"), other.col = 4)  Once we have our data in an appropriate format, one of the first descriptive analysis to do is to get the empirical longitudinal variogram. The variogram allows us to check if within-subjects observations are related. In order to do that, we need to have the unbalanced data format and we will get it very easily by using the joineR variogram (although expect it to be a bit slow if you have a large dataset). As an example, we will load a data set included in the Applied Longitudinal Data Analysis book by Judith D. Singer and John B. Willett and calculate the empirical variogram: # read in data set (tolerance data from ALDA book) tolerance <-read.csv("http://www.ats.ucla.edu/stat/r/examples/alda/data/tolerance1_pp.txt") vgm <- variogram(indv=tolerance$id, time=tolerance$time, Y=tolerance$tolerance)
plot(vgm)


The package also allows us to make some more plotting functions and analysis of longitudinal and survival data together using random effects joint models. Certainly a very interesting package for those who deal with this type of data or  are interested in start working with them.

Try it and tell us your experience.

Featured

# Guest blogger: Tales of R

Collaboration between statisticians and researchers

In the last few days I was thinking about how researchers could collaborate efficiently with their experts in statistics. With the increasing complexity in science, interchanging information can be crucial to get the best results. But what happens when a researcher doesn’t give all the information a statistician needs?

When someone asks for help in this field -as a clinician, I need it often-, many biostatisticians ensure that some basic points are met in a research study -prior to the actual analysis-, and I think they should ask for them as soon as possible:

• Is the research question sensible, and supported by the literature in any way? A good research idea should be accompanied by a short review with pros, cons and all the important references, which could guide you to the most appropriate or most used statistical methods in whatever field. Ask for it, read it. If the study has a flaw at this point, it’s easier to correct them now than later.
• Is the research question defined and detailed at the end of the review? Does it have a main objective? Do they plan further analyses depending on the results? Do researchers give enough information for sample size calculation? With these points you can avoid getting stuck in the middle of a study for a long time. The scientific method is always a good guide if things go wrong.
• Data? Ok, they have planned something taking into account their available knowledge. But how about the resources? Can they collect all the data they need? If not, how can the design be adapted? Often, they have to change something in the research question, and start again. But in the worst case it takes much less time than not doing things right from the beginning.

Have they -researchers, clinicians, whoever-, met all three tips above? Then, your chances of helping them to answer the question will increase, and even the time to the answer can decrease substantially.

I hope this has been useful!

You can find my blog “Tales of R…messing around with free code” here.

Featured

# Sharing statistical analysis and results through web applications

I have to admit I am not completely comfortable with the RStudio IDE yet. However, RStudio projects are of great interest, and the new package Shiny – released last November- is no exception.

Shiny allows you to build web applications in R, so anyone might be able to access your analysis results through an interactive web browser interface. As they state in their website “…your users choose input parameters using friendly controls like sliders, drop-downs, and text fields. Easily incorporate any number of outputs like plots, tables, and summaries..”

The application is structured in two components: a user interface script named ui.R  and a server script – server.R. The first one deals with the input and output format, the second one contains the code to run the analysis.

I made a quick trial following the tutorial and my (very simple) web app was ready in no time. It is based on some of the examples of the SNPassoc vignette, an R package designed to perform genetic association studies. In this application, you can check the summary for a small set of SNPs and get both a plot showing the frequencies and the results of the association study for a given SNP.

By using Shiny, you can run your applications locally or share them with other users so they can run the applications in their own computers. There are several options to distribute your apps, you can check all of them here. In this case, the app can be accessed through the GitHub Git repository. Once the Shiny package is installed (along with the SNPassoc package used in this example) you just have to run the following code:

library(shiny)
shiny:: runGist('4e5e618431a59abe692b')


In my opinion this tool has great potential for sharing analysis results through an interactive and friendly interface.  It might replace files containing a lot of graphics and tables while saving time both to the data analyst and the end user. Have you tried it? Do you want to share your apps with us?

Featured

# Interview with… Guillermo Vinué Visús

Guillermo Vinué Visús completed his degree in Mathematics in 2008, granted by the Universitat de València (Spain). He also holds a Master’s degree in Biostatistics from the same university. After working at the Drug Research Center of the Santa Creu i Sant Pau Hospital in Barcelona (Spain) for a year, he is currently a PhD student in the Department of Statistics and Operations Research at the Universitat de València. His doctoral thesis focuses on Statistics applied to Anthropometry.

1. Why do you like Biostatistics?

I like Biostatistics because it allows me to apply Maths to different real life problems.

2. Could you give us some insight in your current field of research?

I am working on  a research project aiming to develop statistical methodologies to deal with anthropometric data, in order to tackle some statistical problems related to Anthropometry and Ergonomics.

3. Which are, in your opinion, the main advantages of being a researcher?

The main advantage is the possibility to learn everyday a bit more. Research is a continuous learning process.

4. Your whole professional experience has been within the public sector and  the University. How do you see the present and future of research in the Spanish public sector?

The current situation of economic difficulties has caused that unfortunately the government budget for scientific research is more and more limited, so I am concerned about both present and future of the public Spanish research.

5. What do you think of the situation of young biostatisticians in Spain?

Neither better nor worse than other young people in the country. Nowadays, I guess the best way to make progress is to move abroad.

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Enthusiasm, effort and a little bit of R knowledge.

7. Which do you think are the main qualities of a good mentor?

To be attentive and available when needed.

Selected publications:

• Ibañez M. V., Vinué G., Alemany S., Simó A., Epifanio I., Domingo J., Ayala G., “Apparel sizing using trimmed PAM and OWA operators”, Expert Systems with Applications 39, 10512-10520, 2012, http://dx.doi.org/10.1016/j.eswa.2012.02.127
• Epifanio I., Vinué G., Alemany S., “Archetypal analysis: Contributions for estimating boundary cases in multivariate accommodation problem”, Computers & Industrial Engineering 64, 757- 765, 2013, http://dx.doi.org/10.1016/j.cie.2012.12.011
Featured

# P-splines for longitudinal data

Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).

Contact Ipek

In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .

Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.

In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.

Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).

This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.

First of all, lets define a linear mixed model for a longitudinal data.

As a matrix notation;

$y=X\beta+Zu+\varepsilon$

• $y$ is the vector of the observed responses ,

$\left(\begin{array}{cc}y_{1}\\\vdots\\ y_{n}\end{array}\right)=\left(\begin{array}{cc}y_{11}\\y_{12}\\\vdots\\ y_{1n_{1}}\\ y_{2_{1}}\\\cdot\\\cdot\\\cdot\end{array}\right)$

• $\beta$ is the fixed effects parameters vector,
•  $\varepsilon=\left(\begin{array}{cc}\varepsilon_{1}\\\varepsilon_{2}\\\vdots\\ \varepsilon_{N}\end{array}\right)$  is the residual vector,
•  $u=\left[\begin{array}{cc}u_{1}\\u_{2}\\\vdots\\ u_{N}\end{array}\right]$ denotes unknown individual effects.
• $Z=\left[\begin{array}{ccc}Z_{1}&\ldots&0\\\vdots&\ldots&\vdots\\ 0&\ldots&Z_{N}\end{array}\right]$  a known designed matrix linking $u$ to $y$,
• $X=\left[\begin{array}{cc}X_{1}\\X_{2}\\\vdots\\ X_{N}\end{array}\right]$

$u_{i}$, $i=1,\ldots,N$ assumed to be $N(0,G)$, independently of each other and of the $\varepsilon_{i}.\varepsilon_{i}$ is distributed as $N(0,R_{i})$ where $R_{i}$ and $G_{i}$  are positive definitive covariance matrices.

Consider a flexible regression model,

$y_{i}=f(x_{i})+\varepsilon_{i}$

$\varepsilon_{i}\sim N(0,\sigma^2)$

Where $y_{i}$ is the response variable of the observations $i=1 ,\ldots, N$ and $f(\cdot)$ is the smooth function of covariate $x$. We represent this function with a linear combination of $d$ known basis function $B_{j}.$

And we reformulate it as;

$y=B\theta+\varepsilon$

$\varepsilon\sim N(0,\sigma^2I)$

Depending on the basis that is used for the P-splines, $X$ and $Z$ have the following forms;

• Truncated polynomials

$X=\left[1,x,\ldots,x^p\right]$
$Z=\left[(x_{i}-\kappa _{k})^p_+\right]$
$1 \leqslant i \leqslant n$
$1\leqslant k \leqslant \kappa$

• B-splines

$X=[1:x]$

$Z=BU\Sigma^{-1/2}$

Where $U$  is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix  $P=D\prime D$ and $\Sigma$  is a diagonal matrix containing the eigenvalues with $q$ null eigen values (Lee, 2010).

Therefore the model becomes;

$y=X\beta+Zu+\varepsilon$

$u\sim N(0,\sigma^2_{u}I_{c-2})$

and $\varepsilon\sim N(0,\sigma^2I)$

Where $c$ is the number of columns of basis $B$, and the smoothing parameter becomes; $=\frac{\sigma^2}{\sigma^2_{u}}$ (Durban et al, 2005).

With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).

Featured

# SPSS, SAS or R…which one to choose?

Performing statistical analyses by hand in the era of new technologies would seem crazy.  Nowadays, there are three main statistical programs for doing statistics: IBM SPSS, SAS  and R, as it can be read in a more extensive post of this site. Sometimes, biostatisticians need to use more than one package to carry out their analyses.  This means that users of these programs have to move from one to another environment, from front-end to back-end, using different wizard and graphical interfaces, wasting in most occasions an important amount of time. Because of that, in this post I would like to address the differences between the aforementioned programs, pointing out the advantages and the most suitable usage of each software from my personal point of view.

For a good choice of a statistical program, one should take into account several issues such as the following:

1. Habit: It refers to how the biostatistician has gotten used to one particular statistical program and does not need to learn something new.

2. Distrust of freeware: it refers to the license type of the package.

3. Availability of statistical procedures:  are the most used statistical methods included in the package?

4. Management of huge data sets: how big data sets are handled.

Focusing on the first point, it is remarkable that programming with these three packages is quite easy. All of them provide a choice to write the syntax code and run it whenever you want. This also can be applied to anyone who might have to send you the code, including managers or colleagues. Besides, IBM SPSS and R offer a friendly interface allowing the alternative of not learning the syntax procedure. These could be the main reasons for which people continue doing data analysis for a long period with the same statistical program, making a better researcher life!

Another important issue is whether the distribution is freeware or not. When starting to work as a biostatistician, one soon notices the high cost of statistical software licence fees. As you might know, IBM SPSS and SAS are commercial distributions and  R is freeware. Although IBM SPSS and SAS are quite expensive, one should think about the benefits that both programs offer in the long term. For example, some researchers do not trust R, and think its use does not ensure getting the correct results. In those cases, trust comes at a price. Regadless of what they think, R has grown rapidly in the past few years, offering competition to other softwares.

The availability of the most used statistical procedures is another important point to be taken into account.  All of them have integrated in their environments the most popular statistical methods (regression analysis, correspondence analysis, etc..) but sometimes, there is a need for implementing more sophisticated procedures.  As IBM SPSS and SAS are paid-for distributions, they do not let the user to program it, they have fixed instructions. On the contrary, R could be understood as a combined and customized software of multiple small packages to solve the problems. Whenever you have a problem, it is most likely that you will find a solution for it in the packages repository.

Finally, as for the management of huge data sets, it is noteworthy that SAS is the winner. SAS has a robust platform for handling big data sets and it does not  need the whole computer memory for the analysis. In this sense, R still has a long way to go even though the performance of R in this area has recently been exponentially increased. I would also say that IBM SPSS does not have enough support.

To sum up, I would like to give a couple of advices. From the point of view of my experience, I would recommend to use IBM SPSS for medical researchers who start doing statistics: it is quite easy and intuitive and there is no need to learn the syntax. But not for a long period of time: learning basic R coding is not difficult and will be sufficient for the requirements of this type of researchers. It is only a small effort but really worthy! Nonetheless, for advanced researchers and biostatisticians, a combination between SAS and R would be perfect – that is what I do!, I would say I am RSAS user… -. Now, you have the choice: IBM SPSS, SAS or…R?

Featured

# …a scientific crowd

While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.

Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.

The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”

Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here).

Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)

In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).

Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17th of May 2013.

Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.

That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!

Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!

Featured

# A computational tool for applying bayesian methods in simple situations

Luis Carlos Silva Ayçaguer. Senior researcher of the Escuela Nacional de Salud Pública from La Habana, Cuba; member of the development team of Epidat. Degree in Mathematics from Universidad de La Habana (1976), PhD from Universidad Carolina (Praga, 1982), Doctor of Science from Universidad de Ciencias Médicas (La Habana, 1999), Titular Academic from República de Cuba.

Email: lcsilva@infomed.sld.cu

Soly Santiago Pérez. Technical statistician at Dirección Xeral de Innovación e Xestión da Saúde Pública (General Directorate of Public Health, Spain) from 1996 to present; member of the development team of Epidat. Degree in Mathematics from Universidad de Santiago de Compostela (Spain, 1994). Specialist in Statistics and Operational Research.

In this post, we present a user friendly tool for applying bayesian methods in simple situations. This tool is part of a free distribution software package, Epidat, that is being developed by the Dirección Xeral de Innovación e Xestión da Saúde Pública (Xunta de Galicia, Spain) since the early 90′s. The general purpose of Epidat is to provide an alternative to other statistical packages for performing analysis of data; more specifically, it brings together a broad range of statistical and epidemiological techniques under a common interface. At present, the fourth version of Epidat, developed in Java, is freely available from the web page: http://dxsp.sergas.es; to download the program, registration is required.

As stated above, one of the methods or “modules” included in Epidat 4 is Bayesian analysis, a tool for the application of bayesian techniques to basic problems, like the estimation and comparison of means and proportions. The module provides a simple approach to Bayesian methods, not based on hierarchical models that go beyond the scope of Epidat.

The module of Bayesian analysis is organized into several sub-modules with the following scheme:

• Bayes’ theorem
• Odds ratio
• Proportion
• One population
• Estimation of a proportion
• Assessment of hypotheses
• Two populations
• Estimation of effects
• Assessment of hypotheses
• Mean
• One population
• Estimation of a mean
• Assessment of hypotheses
• Two populations
• Estimation of a difference
• Assessment of hypotheses
• Bayesian approach to conventional tests

The first option of the module can be used to apply Bayes’ theorem. The following three options (Odds ratio, Proportion and Mean) are designed to solve basic inferential problems under Bayesian logic: estimation of odds ratios, proportions and means, as well as differences or ratios of these two last parameters. The punctual estimation is accompanied by the corresponding credibility interval. The techniques available in these options also include methods related to hypothesis testing. Finally, the last sub-module allows the evaluation of conventional tests from a Bayesian perspective.

Some specific options of Bayesian analysis require the use of simulation techniques. Nevertheless, the user does not need to understand the simulation process to use the program and interpret the results correctly. In addition, the module has a user-friendly interface that allows the user to plot the a priori distribution and choose the values of its parameters (see figure above). The output of Bayesian analysis includes both numerical and graphical results, and the graphics can be edited and modified by the user. In addition, the contents of the output window can be saved as a file in several formats: *.epi, *.pdf, *.odf, or *.rtf.

Finally, like all modules of Epidat, Bayesian analysis has a useful help feature which can be accessed on the help menu in pdf format. This facility has been developed with a didactic and critical approach, including statistical basics of the methods, bibliographic references, and examples of the different options.

Featured

# R2wd package: another tool to create reports

The last post published in FreshBiostats, by Hèctor Perpiñán, was about reports combining the use of R and Latex. In it, he explained how to transport tables and figures from the R output to Latex.

Well then, this week I shall continue talking about another tool to write reports when we do the statistical analysis with R. It allows us to reduce the time spent to draft the reports and be more accurate by minimising human copying errors. I am going to focus my attention on “R2wd” package to write MS-Word documents from R.

Although most statisticians are used to work daily with Latex which results in high quality appearance, in most cases our clients have never heard about it and prefer to work with MS-Word documents, which are easier to handle.

The R2wd package needs either the statconnDCOM server (via the rcom package) or the RDCOMClient to communicate with MS-Word via the COM interface. Once this interface is installed (on Windows OS), we can create a document using all functions available in this package: wdGet, wdTitle, wdSection, wdType, wdSetFont, wdTable (the object must be a data frame or an array), wdPlot, wdApplyTheme, etc.

Not only do these functions allow us to introduce tables or figures from R output, but can also be used to inject text elements into Word report, themes and templates, different type of text (normal, italic, Verbatim,…), etc.

The following code shows a little example on how to use this package:

</pre>
### An easy example to use "R2wd" package ###

library("R2wd") #of course, you must install it before!
#require("rcom")

wdGet() #To open a new Word document

#It is possible that we get an error message with wdGet() function.
#In this case we will need to install:
#installstatconnDCOM()

wdTitle("We introduce here our file title")
wdSection("First section begins here")

wdBody("Inserts text in 'Body' style at the current cursor point in Word.")
wdWrite("The function wdBody is similar to 'wdWrite'",paragraph=TRUE)
wdType("Now we use an italic type and a centering text",italic=TRUE,alignment="center")
wdNormal("To return a Normal type")
wdWrite("We can also insert footnotes")
wdInsertFootnote("insert a footnote") #you can insert footnotes in the
#word document at the cursor position.

wdSection("Now we include another section")
wdSubsection("A toy example")

wdBody("Insert tables of results")

age <- c(23,12,45,34,18,41)
gender <- c(1,1,0,1,0,0)
height <- c(172, 150,169,180,188,160)
data <- data.frame(age,gender,height)

wdTable(data,autoformat=2)#three possibilities to autoformat parameter
wdItemize()
wdWrite("insert a text in the bullet",paragraph=TRUE)
wdItemize(Template=3)
wdWrite("we finish the description",paragraph=TRUE)
wdBody("To finish this easy example we include a plot in our Word document")

wdPlot(data$age,data$height,col="red",pch=20,height = 10, width =10, pointsize = 25)

wdSave()
wdQuit() #To close the session



As you have seen, besides being able to format our report as we want (although it may seem cumbersome to do from R), R2wd is really useful because with an only function – wdTable() – and one “click” I can copy a large table in my report with no mistakes.

You will find more functions here. Dare to try it! You will save plenty of time!

Featured

# Some tools to do a report (with Latex)

In another post I have already talked about different topics not directly associated to Statistic. At this time I will write about reports combining the use of R and Latex. Specifically about how to export tables to Latex and insert latex typesetting in R figures.

Obviously our technical work is very important but the final presentation which should be taken care of. Sometimes the problem is to move our results from the statistical software to the text editor so I will talk in this post about my favourite tools to perform this task.

When looking for a text editor to do a report, there are may options to choose between (MS Word, Latex, knitr/Sweave, Markdown, etc.). I pick out Latex versus MS Word given its possibility to easily include math language. The connection between knitr/Sweave and Markdown with R is straightforward (these two software imbibe R code), in future posts I will talk about them.

R tables to Latex

As I commented  I first work with R and then move the results to Latex. One of the simplest and most effective tools is the function xtable() that we can find in the package of the same name, xtable.

• xtable(), this function creates the code that we write in latex from an R table.

Other important packages in R with extra functionalities to convert object from R to Latex code are:

• Hmisc: you can use the function latex() to create a tex file.
• texreg: highly configurable and designed to convert model summary outputs.

Figures

Many people will know the graphics of Latex, called TikZ (see page examples), but what is not so known is that this type of graphics allow a wonderful connection between R and Latex. It allows that formulas, numbers, etc. displayed in the format of choice in Latex. To do this we simply have to follow the next steps:

1. In R: Create the tikz images from R using tikz() function, see this, in package tikzDevice ( install.packages(“tikzDevice”, repos=”http://R-Forge.R-project.org&#8221;) ). This function will create a tex file that can be compiled into PDF, but the only thing you need to do is to include the tex file into your Latex document.
2. In Latex: Insert in Latex document with function \input{normal.tex} and \input{symbol-regression.tex}. The R figure uses directly your format in Latex. You need include on preambule of Latex \usepackage{tikz}.

Will you tell us your “tricks” when making reports?

Featured

# Dealing with strings in R

As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.

Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.

Some R functions I have found quite useful when handling this data include the following ones:

• colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
• grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern

As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.

Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:

• str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
• fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions

Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.

This is just a brief summary of some options availabe in R. Any other tips on string handling?

Featured

# Interview with…Moisés Gómez Mateu

Moisés Gómez Mateu is a PhD student at the Universitat Politècnica de Catalunya (UPC) where he works as research assistant.

Contact Moisés

1. Why do you like Biostatistics?

I have been very curious since I was a child; that’s why I like statistics. Moreover, statistics related to biology and medicine allows helping people to improve their quality of life.

2. Could you give us some insight in your current field of research?

I focus my thesis on survival analysis, especially on the issue of composite endpoints in clinical trials. The main aim is to analyze what is the best primary endpoint to use, extend the statistical theory, and make practical tools available to researchers by means of a library in R, an on-line platform, etc.

3. Did you find it difficult to move from the private sector to the University?

No. In fact, I left my job as a consultant in a marketing research company to study the MSc Statistics at the UPC, and I think it was a very good decision.

4. Which are, in your opinion, the main advantages of being a researcher?

It is very satisfactory. The results you get or the research you conduct have nothing to do with the private sector. One usually investigates issues that are related to thing you like and to help improving science in general, not only to earn money.

5. What do you think of the situation of young biostatisticians in Spain?

The reality is that several colleagues and friends are working-studying abroad or looking for opportunities …

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Curiosity, Analytical skills and Creativity.

7. Which do you think are the main qualities of a good mentor?

Expertise, Modesty and Open-mindedness.

Selected publications:

• Gómez G, Gómez-Mateu M, Dafni U. Informed Choice of Composite Endpoints in Cardiovascular Trials.Submitted.
• Gómez G, Gómez-Mateu M. The Asymptotic Relative Efficiency and the ratio of sample sizes when testing two different null hypotheses. Submitted.
Featured

# Hierarchical models: special structures within repeated measures models.

Living creatures tend to organize their lives within structured communities such as families, schools, hospitals, towns or countries. For instance, students of the same age living in a town could be grouped into different classes according to their grade level, family income, school district and other features of interest. Other examples related with health care workers and patients show clear hierarchical data structures as well.
Hierarchical or nested structures (usually known as HLM) are very common throughout many research areas. The starting point of this data pattern was set up in the field of social sciences. Most research studies in this area were focused on educational data, where the main interest was to examine the relationship between inputs such as students, teachers or school resources, and student outcomes (academic achievement, self-concept, career aspirations…). Under this scenario, researchers emphasized that individuals who are drawn from an institution (classroom, school, town, …) will be more homogenous than subjects randomly sampled from a larger population: students belonging to the same classroom have the experience of being part of the same environment (places, district, teachers,…) and experiences. Due to this fact, observations based on these individuals are not fully independent.

As noted, hierarchical models consider that exist dependency among observations within the same study unit. Till last decades, owing to the lack of software development, ordinary least squares regression (OLSR), classical regression, has been used to estimate the aforementioned relationships. On consequence, results obtained from OLSR show too small standard errors leading to a higher probability of rejection of a null hypothesis than if: (1) an appropriate statistical analysis was performed; (2) data included truly independent observations. It is clear that the main issue that researchers must address is the non-independence of the observations.
Hierarchical modeling is similar to OLSR. It can be seen as an extension of the classical regression, where at least two levels are defined in the predictive model. On the base level (also called the individual level, referred as level1), the analysis is similar to OLSR: an outcome variable is defined as function of a linear combination of one or more independent level 1 variables:

$Y_{ij} = \beta_{0j}+X_{1}\beta_{1j} +\ldots+\beta_{kj}X_{k}+\epsilon_{ij}$

where $Y_{ij}$ is the value of the outcome variable of the $i$th individual of group $j$, $\beta_{0j}$ represents the intercept of group $j$, $\beta_{1j}$ is the slope of variable $X_{1}$ of group $j$. On subsequent levels, the level 1 slope and intercept become dependent variables being predicted from level 2 variables:

$\beta_{0j} = \delta_{00}+W_{1}\delta_{01} +\ldots+\delta_{0k}W_{k}+u_{0j}$
$\beta_{1j} = \delta_{10}+W_{1}\delta_{11} +\ldots+\delta_{1k}W_{k}+u_{1j}$

Though this process, it is possible to model the effects of level 1 variables and level 2 variables on the desired outcome. In the figure of this post, one can observe that there are three main levels: patients (level1) belong to hospitals (level 2) where, at the same time, hospitals are located in certain neighborhoods (level 3).

This kind of modeling is essential to account for individual – and group level variation in estimating group-level regression coefficients. However, in certain cases, the classical and HLM approaches coincide: (1) when there is very little group-level variation and (2) when the number of groups is small and consequently, there is not enough information to accurately estimate group-level variation. In this setting, HLM gain little from classical OLRS.

Now, it is your opportunity. You know where is worth the effort of applying the HLM methods instead of classical regression.

Featured

# A pinch of OR in Biostatistics

The first thing in common between Operational Research (OR) and Biostatistics, is that both terms are often misunderstood. OR, or the Science of Better, as it is commonly known, has lots to do with optimization but there is much more to it…Some might be surprised to read that many of the tools of this discipline can be -and are actually- applied to problems in Biostatistics.

It all starts with Florence Nightingale  volunteering as a nurse in the Crimean war in 1853…Her pioneering use of statistics for evidence-based medicine, and her operational research vision to reform nursing, led to a decline in the many preventable deaths that occurred throughout the nineteenth century in English military and civilian hospitals.

Since then, integrated approaches are commonplace: disease forecasting models, neural networks in Epidemiology, discrete-event simulation in clinical trials,…

Figure: Genetic algorithm representation using R. Find here the code for the original dynamical plot by FishyOperations.

There is also an increasing interest in computational biology in OR. Examples of application of these techniques vary from linear programming based flux balance analysis models for cell metabolism studies, to sparse network estimation in the area of Genetics.

In R, there are many packages with which to apply these techniques, from genalg for Genetic algorithms (used to recreate figure above) or games on Games theory, to more specific packages in Bioconductor like CellNOptR, survcomp or OLIN. Also, a general Task View on Optimization can be found here.

Finally, a quick mention to the YoungOR 18 Conference that will be held in Exeter (UK), 9-11 April 2013. It will be covering topics of common interest for biostatisticians with streams on Health, Sustainability, etc. Plenty of inspiration for future posts!

Have you ever used any of these techniques?Any particular tips that you want to share?Tell us about it!

Featured

# Biostatistics in a Health Services Environment

Aurora Baluja is MD in the Department of Anesthesiology, Intensive Care and Pain Management of the Hospital Clinico Universitario of Santiago de Compostela, and PhD candidate at the Forensic Medicine Unit of the University of Santiago de Compostela (Spain).

As a medical doctor involved in patient management, -in operating room and in the ICU-, every day I witness the huge load of data that has to be processed, interpreted, and stored. In this situation, Biostatistics and Bioinformatics support becomes greatly important to analyse trends and patterns, that  ultimately lead to improve patient care.

I do research about risk profiles of mortality in the ICU, and given the amount of data obtained, the task of reporting my results and ensure reproducibility became very time-consuming.

In this post I intend to comment on the main tools that helped me so much to overcome those obstacles:

•  R: for me, the possibility of learning a statistical computing language to write and recycle scripts, was definitely the way. R is a powerful, open source programming language and software environment, that allows a full variety of analyses and… beautiful graphs.
• RStudio: to find a good IDE  for R code was also important. Fortunately, Altea recommended me this easy-to-use, visual piece of cake, with integrated options to report code and results… Thanks!
• R Markdown is a quick, easy method to write scripts and to report clean code and results, with some formatting options, in an HTML page. It is implemented in R-studio by the package “knitr”. Just one click away, you can generate a complete, dynamic report with text and graphics.
•  LaTeX: the main reason for me to use LaTeX is not writing documents with beautiful symbols and equations, but to somehow link my R console to a document editor, in order to generate tables and reports, just like I do with R Markdown. For this, I need to use R-studio again,  and 4 more ingredients:
1. A LaTeX distribution: I have installed TeX Live for Linux, and MiKTeX for MS Windows. A variety of templates, included often in distributions, make the first approach quite easy. Beyond the use of Sweave (see below), I became so fond of LaTeX that now it is my favourite text editor for documents, posters and slides!
2. R Sweave is the link between R-scripts and LaTeX, making possible to write an entire LaTeX document with dynamic results from R commands embedded in it.
3. Texreg: maybe the reason why I’ll never quit using LaTeX. Its magic begins after you have run several models of your data, and you are trying to see and compare *all* of them at a glance. It generates latex-formatted tables with your models, ready to paste into a LaTeX document. As a tip, I often use them preserving their \{tabular} environments and customising \{table} options to fit my document style.
4. Xtable: another R package, that allows to print nice tables in LaTeX format.

I encourage those who haven’t used any of these tools, to give them a try… surely they will help you!

Featured

# Measuring Disease Trends through Social Media

Nichole Knupp is a freelance blogger and marketing professional who writes guest posts and website content for a variety of different blogs and niches that interest her.

People in the field of Medicine, those in Master´s in Public Health programs (see examples here and here) or in other disciplines such as Biostatistics (see previous post on the topic) might wonder how social media can play a valid role in their career.  While social media has taken the world by storm, it has mostly been thought of as a tool for sharing personal information, marketing, or keeping up on the latest gossip.  However, after this past flu season, those in the medical and scientific fields are finding a whole new way to use information that is trending.  Here are just a few of the things that social media is doing to benefit our health and sciences community.

Measuring Disease Trends
One of the terms being used for those watching and tracking the spread of illnesses through social media is “social media analytics” (see for instance, social network analysis).  Quite simply, it is tracking outbreaks of various viruses through mining social media.  This past season, people were following the spread of influenza through a flu hash tag (#flu).  Viruses such as H1N1 and the Swine Flu were able to be monitored on Twitter.  Further, they found when information was extracted efficiently, it was accurate and there was even a possibility of forecasting further trends.  The most pertinent information though, was what was happening in the present.  Different agencies were able to find what area was the hardest hit and who was at risk.

Public Interest and Concerns
The health and sciences community also found social media to be a great platform in finding out how interested the public was in disease outbreaks and trends.  It was also a way to gather concerns so various agencies could address them.  Social media is utilized by people as a way to have their voices heard through an outlet that is part of their daily lives.  While some people are unwilling to sit down and answer a survey, important information can be gathered by what is mentioned on daily Twitter streams.

Disease Outbreaks and Travel
Since social media is a tool that is being used worldwide, it has also been helpful in tracking outbreaks abroad.  US citizens, and others traveling, can easily be informed through public social media announcements.  Currently there are conversations happening on how best to use the information gathered on social media, and what role it will play in informing the public and addressing questions and concerns. (A very interesting article on the topic can be found here).

Social Media Sources
While Twitter has been the main source used for tracking trends, Instagram, Facebook, and Google Trends also play a part.  Some places to look if you’re interested in seeing how it all works is checking out the HealthMap stream on Twitter (@healthmap), do a Google Trends “flu” search, or try a flu hash tag search (#flu) on Twitter.

It’s not hard to imagine the possibilities and many other things we will be able to track in the future, but one thing is already certain— there is a use for social media if you work in the medical sciences, Public Health, Biostatistics,…

Featured

# Do we need Spatial Statistics? When?

Spatial Statistics, What is it? and Why use it?

The approach taken in this post is to offer an introduction of basic and main concepts of spatial data analysis as well as the importance of its utilization in some areas like epidemiology or ecology among others. But, before to introduce a definition of spatial statistics and some concepts about this field, I consider relevant to mention some situations in which our data need to be seen as ‘spatial data’.

It is possible that people associate spatial statistics with analysis that contain numerous maps. However, it goes beyond creating these, in fact spatial data analysis is subject to internal structure of the observed data. We therefore have to be careful with the questions not directly answered by looking at the data.

We could make a long list of areas where we can apply spatial statistics: epidemiology, agriculture, ecology, environmental science, geology, meteorology, oceanography,… even econometrics. In all of these we could ask questions like the following to recognize if our data have a spatial structure:

• Does the distribution of cases of a disease form a pattern in space?Could we relate health outcomes to geographic risk factors?
• Do they influence environmental and geographical factors in the distribution and variability in the occurrence of fishery species?

But, how can we explain what is spatial statistics? I am sure that we could find infinite definitions of spatial analysis or spatial statistics. We can say that spatial statistics is responsible for analyzing the variability of random phenomena that are linked with their geographic locations. It is used to model the dependence of the georeferenced observations (these can be point observations or areal data).

Depending on the type of data and the purpose of the spatial analysis itself, we classify spatial data sets into one of three basic types: geostatistical data, lattice data (or areal data) and point pattern data. The main dissimilarity between these resides in the set of observations. Let us look at this briefly with some examples:

• Geostatistics: the geostatistical data are represented by a random vector $Y(s)$, where the locations $s \in D$ varies continuously over a fixed observational region ($D \subset \Re^{r}$). These data are characterized by spatial dependence between the locations, and the main objective in application of geostatistics is to do predictions in unobserved locations from study region. kriging is the best known technique for prediction in geostatistical data. Some examples of this data are: occurrence of species in a region, annual acid rain deposition in a town, etc.

• Lattice data: the fixed subset ($D \subset \Re^{r}$) of observations are located in a continuous region (it can have a regular or irregular shape). It is partitioned into a finite number of geographical areas with well-defined boundaries. A characterization of this type of spatial data is that neighbouring areas are usually more similar than distant ones. An example can be the observed locations from agricultural field trials (here the plots are a regular lattice).
• Point pattern data: this is the last type of spatial data where $D \subset \Re^{r}$ is itself random. The own locations determined phenomena that occurred randomly in one place, thus generating spatial patterns. One example of point pattern data is the locations of a certain species of tree in a forest (here only the locations are thought of as random).

The above explanation have only been a briefly description of three main types of spatial data. There are two basic methodologies to carry out spatial analysis: through classical statistics or by Bayesian approach (using mainly hierarchical Bayesian methods). The latter will be dealt with in more detail in future posts, given their importance as they can be applied in many situations, like spatial statistics (you can see the book ‘Hierarchical Modeling and Analysis for Spatial Data by Banerjee et al.’ for more information about this).

Featured

# The complex structure of the longitudinal models

Two weeks ago, we started to talk in this blog about longitudinal data with the post by Urko Agirre. This type of data involves complex structure models called longitudinal models.

Longitudinal studies have two important characteristics:

1. They are multivariate because for each studied individual many temporal measurements from the response variable (and covariates) are collected.

2. They are multilevel as the variables measured are nested within the subjects under study, therefore resulting in layers.

These characteristics allow us to make inference about the general trend of the population as well as about the specific differences between subjects that can evolve in another way regarding the overall average behavior.

At the beginning of the 20th century this type of data started to be modelled. Different proposals appeared such as ANOVA models (Fisher, 1918), MANOVA models (generalised from ANOVA models to multivariate) or growth curves (Grizzle and Allen, 1969). All these proposals showed improvements in some aspects. However, they left some others unresolved. On the one hand, ANOVA models are univariate and our data are multivariate. On the other hand, MANOVA models are multivariate but assume  independence between intra-subject observations (observations from the same individual are not independent in general). Finally, the last option, growth curves, contemplate intra-subject observations dependence but are too restrictive on the design matrix.

It was not until the early 80s when a proposal that included all the aspects of these complex data appeared. Laird and Ware proposed the application of linear mixed models (LMM) in the paper “Random-Effects Models for Longitudinal Data“.

The basic structure of these LMM of each patient $i$ is:

$y_i = X'_i\beta + Z'_ib_{i} + W_i(t_i) + \epsilon_i$

where

• $y_i=(y_{i1},...y_{in_i})$ is the vector of measurements of the response variable, made to ith a total of $m$ subjects in times $t_i=(t_{i1},...t_{in_i})$. $n_i$ is the number of repeated measured of the ith patient.

• $X'_i\beta$ represents the deterministic model, being $X'_i$ the submatrix design covariates associated with the ith individual, and $\beta$ its associated parameter vector.

• $Z'_ib_i$ are random effects responsible for capturing the variability between individuals.

• $W_i(t_i)$ includes intra individual variability, ie the variability between observations of the same subject.

• Finally, $\epsilon_i$ reflects the variability that is not due to any kind of systematic error that we can determine.

To specify each of these elements it is essential to first do the descriptive graphics presented in this previous post.

We have learned more about repeated measures and in next posts we will talk  more about that because it is only the beginning. To be continued!!!

Featured

# Mining genomic databases with R

When dealing with genomic data, retrieving information from databases is a required endeavor.  Most of these repositories of genomic and proteomic data have very useful and friendly interfaces or browsers perfectly suitable for particular searches. But if you are working with high dimensional data e.g. NGS data, more efficient tools to mine those databases are required. R offers several packages that make this task straightforward.

NCBI2R is an R package to annotate lists of SNPs, genes and microsatellites. It obtains information from NCBI databases. This package provides quite useful functions for getting a quick glance at a given set of SNPs or genes, retrieving SNPs in a gene or getting genes belonging to a KEGG pathway (among some other features) as shown in the code below.

library(NCBI2R)
GetIDs("MAPT[sym]")
GetGeneInfo(4137)
mysnps <- GetSNPsInGenes(4137)
GetSNPInfo(mysnps)
myset <- GetIDs("KEGG pathway:Alzheimer´s disease")


biomaRt package, part of Bioconductor project, is an R interface to the well-known BioMart data management system that provides access to several data sources including Ensemble, HGNC, InterPro, Reactome and HapMap.

Most of analysis can be performed with just two functions: useMart() to define the dataset, and getBM() to perform the query. It performs onlique queries baed on attributes – values we are interested in retrieving-,  filters – restrictions on the query- and a given set of values of the filter. Once we define the dataset we are interested in, we can check all the filters and attributes available. Let´s say we want to look for those genes associated with an OMIM phenotype.

source( "http://www.bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
listMarts()
mymart <- useMart("ensembl",dataset = "hsapiens_gene_ensembl")
listFilters(mymart)
listAttributes(mymart)
getBM(attributes = c("entrezgene", "hgnc_symbol", "mim_morbid_description"), filters ="entrezgene", values = myset, mart = mymart)


Org.Hs.eg.db , which is also part of Bioconductor, is an organism specific package that provides annotation for the human genome. It is based on mapping using Entrez Gene identifiers. In a very similar way to biomaRt, it allows to query  the databases by means of the select() function specifying cols – kind of data that can be returned-, keytypes -which of the columns can be used as keys-  and a given key.

source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
cols(org.Hs.eg.db)
select(org.Hs.eg.db, keys= "MAPT", cols = c("SYMBOL", "CHR", "CHRLOC", "UNIPROT"), keytype = "SYMBOL")
`

These three packages offer in many cases a good alternative to other well-known data mining systems such as UCSC Table Browser data retrieval tool. Besides, commands from these packages are quite simple, so even those not that familiar with R language can take good benefit from them.

Just a quick tip to finish – I have just found out about fread() function (data.table package) for reading large files. It really makes a difference!

Featured

# From Descriptive to Repeated Measures Data….one small step for studies, one giant leap for (bio)statistics

Traditional epidemiological descriptive studies, also called cross-sectional  studies,  have been characterized for reporting population health, describing the existing distribution of the collected exposure factors, variables, without relating to other hypotheses.  In other words, they should try to give an answer to three basic “W” questions: who, where and when. Most important uses of this kind of research include health planning and hypothesis generation. Nonetheless, the most important pitfall is that researchers might draw causal inferences when developing this type of studies. Temporal associations between the effects and the outcomes of interest might be unclear. Thus, when a researcher wants to verify the causality effect between two variables, a more appropriate design is highly recommended, such as a study with two or more observations per subject collected over the established research period. The latter design corresponds to repeated measurement data structure, more specifically, to a longitudinal data analysis (a common repeated analysis form in which measurements are recorded on individual subjects over time).

As mentioned in the previous paragraph, the main difference between both research study designs, cross-sectional and longitudinal, is that each experimental unit participating in the first one is observed only once, so for each exposure factor one has only one value per subject. In other words,  each row in the dataset is an observation. However, in longitudinal data each subject is observed  more than once.

It is also worth pointing out an increase in the complexity of the statistical approaches when moving from descriptive analysis to repeated data studies. For instance, in the first setting the statistical methods in use are the simplest ones: mean and percentage comparisons by means of classical tests, regression analysis, etc…However, in repeated measures data sets, and specifically in longitudinal data analysis, is required to  use special statistical techniques for valid analysis and inference. Thus, researchers should be aware of three important points to perform a proper statistical model, in this order:  (1) the trend of the temporal component; (2) the variance-covariance structure; (3) the mean structure. More accurately, the overall trend of the evolutive analysis should be guessed first of all. Temporal trends can follow a linear, quadratic, cubic or even a fourth grade polynomial function. Besides, as observations in the same subject are more likely to be correlated, repeated measures analysis must account for this correlation (the within and between-subject effects must be controlled). Among the possible covariance structures, compound symmetry, unstructured  and  first-order autoregressive  are the most used.  As for the mean structure, the potential exposure factors which could be related with the dependent variable should be included in the model.

Longitudinal studies play an important key role, mostly in epidemiology and clinical research. They are used to determine the change in the outcome of measurement or to evaluate the effectiveness of a new treatment in a clinical trial, among other applicable settings. Under these scenarios, due to the complexity of the statistical analyses,  longitudinal studies involve a great deal of effort, but they offer several benefits. The most importants, from my point of view, are the following: (1) The ability to measure change in outcomes and/or exposure at the individual level, so that the researcher has the opportunity to observe individual patterns of change; (2) the temporal order of the exposure factors and the outcomes is measured. Therefore, the timing of the outcome onset can be correlated with the covariates.

Finally, there is no specific statistical package to perform this kind of analyses. Nowadays, most of them include on their recent releases all the procedures to perform, at least, a basic longitudinal analysis. Now, there is no excuse for identifying a repeated/longitudinal analysis from a descriptive one, and developing them without any doubt…. Do you agree??

Featured

# Tricky bits: Ordinal data

From questionnaire responses studies to plant-flowering stages and drugs effects scoring analyses, researchers and biostatisticians often face situations when some of the variables are not continuous but present fixed ordered categories instead.

The straightforward approach would be to deal with this data taking into account its full nature. However, in practice, when trying to do the analysis, many professionals still take a continuous (interval scale) approach in order to ensure that the most familiar statistical techniques can be employed.

Carifio and Perla claim that this debate on the use and “abuse” of the so-called Likert scales has been going on for over 50 years, and state as one of the main advantages of what they refer to as “intervalist position”,  the easy access to both traditional and more complex techniques based on the former.

Other authors such as Jamieson and Kuzon Jr. et al advocate for tailored procedures to deal with this kind of variables. Their argument being based on the need to reflect the ordered nature of the data and the difficulty of measurement of distances between the different categories within variables in the case of a continuous approach.

Concerns regarding plotting this data have often been raised too. From pie to divided bar charts (see figure above and other examples here and here), it seems difficult to decide which form of visualisation is more appropriate and easily understandable.

Thankfully, the arrival of new graphical applications and user-friendly specialised software like R packages ordinal and MCMCglmm, is helping bringing consensus closer and closer so we can all be soon speaking the same “ordinal language”.

For or against? I look forward to reading your views on this!!

Note: A highly recommended book on the topic is Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed), Wiley.

Featured

# 2nd Biostatnet General Meeting Review

With a marked focus on young researchers in particular and health-related Biostatistics in general, this 2nd Biostatnet General Meeting, celebrated in Santiago de Compostela (Spain) the 25th and 26th of January, has been a fantastic opportunity for the Network´s members to gather together and discuss common topics of concern as well as successful stories.

FreshBiostats bloggers participated actively and now want to make our readers witnesses of this stimulating event.

7 of the 8 Biostatnet´s main researchers

After the welcome and opening session chaired by Carmen Cadarso, focusing on presentations on the past activities of the Network by Emilio Letón, David Conesa, Inmacularada Arostegui, and Jordi Ocaña, a busy program of events was fitted in a day and a half conference-like event:

Young researchers oral communications

Because of the meeting´s high participation, oral communications by young researchers of Biostatnet, were divided into three sections:

• BIO session

The topics discussed in this first parallel session were the choice of primary end-points by using a web application interface by Moisés Gómez-Mateu, the modeling of a non proportional hazard regression by Mar Rodríguez-Girondo, and the randomization tests implemented in clinical trials by Arkaitz Galbete. The second part of the session continued with two talks on Chronic Kidney Disease but from two different approaches: the first one, from a survival analysis (competing risks analysis) point of view, was presented by Laetitia Teixeira, and the second one, based on longitudinal analysis (Bayesian longitudinal models), was defended by Hèctor Perpiñán. Finally, Mónica López-Ratón presented his work on estimation of generalized symmetry pointS for classification in continuous diagnostic tests. This session was moderated by Carles Serrat.

• STAT session

A varied arrangement of talks were framed within the STAT session that featured the interesting view of Joan Valls on the experience of the biostatisticians working in the IRBLleida, two applications of Structured Additive Regression (STAR) models by Elisa Duarte and Valeria Mamouridis, a comparative analysis of different models for the prediction of breast cancer risk by Arantzazu Arrospide, an optimal experimental design application presented by Elvira Delgado, and a simulation study on the performance of Beta-Binomial SGoF multitesting method under dependence.

• NET session

In this third parallel session, topics such as “bio” research as well as others related to design of experiments were covered. Irantzu Barrio started with a talk on development and implementation of a methodology to select optimal cut-points to categorise continuous covariates in prediction models. Also in this session, Mercedes Rodríguez-Hernández presented her work on D-optimal designs for Adair models.

Also covering “bio” topics,  a talk on derivative contrasts in quantile regression was given by Isabel Martínez-Silva. María Álvarez focused afterwards on the application of the method of maximum combination when comparing proportions. The two last communications dealt with the cost-efectiveness study of treatments for fracture prevention in postmenopausal women by Nuria Pérez-Álvarez, and the application of Generalised Additive Mixed Models for the assessment of temporal variability of mussel recruitment, by María P. Pata.

Congratulations to the happy winners!!

To conclude these three sessions, Moisés Gómez-Mateu and Irantzu Barrio,  the two winners of both ERCIM´12 Biostatnet invited sessions, received their awards (see picture above).

Posters sessions

Two posters sessions were also included within the hectic program of the meeting, covering a wide range of topics varying, for instance, from the analysis of clinical and genetics factors (Aurora Baluja) to a collective blogging experience like ours (find it here)

As a courtesy to the young researchers participating in the meeting, Biostatnet´s main researchers gave each of us  The Cartoon Guide to Statistics, which definitely finds the fun side of Statistics (see the snapshot below for a nice example ;P). We are very grateful for this gift and promise to make good use of it, maybe by trying to convince those that are still skeptical about the enjoyable side of this amazing science!

Image extracted from “The Cartoon Guide to Statistics”

Roundtables

Throughout the meeting a total of 5 sessions of roundtables and colloquiums took place. Both professionals in the field of Biostatistics as well as young researchers participated and offered their views on different topics.

“Biostatisticians in biomedical institutions: a necessity?” was the first of the interventions of the meeting, which was covered by Arantza Urkaregi, Llorenç Bardiella, Vicente Lustres, and Erik Cobo. They attempted to respond the question with their professional experiences. The answer was unanimously positive.

The colloquium “Genomics, Biostatistics and Bioinformatics” was chaired by Malu Calle and featured presentations from Pilar Cacheiro (one of our bloggers), Roger Milne, Javier de las Rivas, and Álex Sánchez. They emphasized the importance of bringing together biostatistics and bioinformatics in the “omics” era, and a vibrant discussion followed regarding the definition of both terms.

The “Young Researchers roundtable” also generated a refreshing discussion about opportunities for young researchers in Biostatistics. Again, two of our bloggers, Altea Lorenzo and Hèctor Perpiñán, were involved in the session along with Núria Pérez and Oliver Valero, with Moisés Gómez as moderator and Isabel Martínez as organiser. The main conclusions reached in this table were the need for the young biostatisticians to claim their important role in institutions, the aspiration to access specialised courses on the field, and the importance of communication, collaboration, and networking.

In the second morning, another very important topic, “Current training in Biostatistics”, was presented by three professors, Carmen Armero, Guadalupe Gómez and José Antonio Roldán, who currently teach in Biostatistics masters and degrees programmes offered by Spanish universities. Some interesting collective projects were outlined and will hopefully be implemented soon.

Plenary talk

We cannot forget the invited talk on “Past and Current Issues in Clinical Trials” by Urania Dafni, director of  the Frontier Science Foundation-Hellas and Biostatistics professor and director of the Laboratory of Biostatistics of the University of Athens´ Nursing School of Health Sciences. An overall view of this hot topic and the importance of the presence of biostatisticians in the whole process of design and development of drugs was given by this reputed professional in the field. This session and the following discussion were moderated by Montserrat Rué.

Closing colloquium

Finally, a session on the future of Biostatnet and the different alternatives for development and improvement was chaired by Jesús López Fidalgo and María Durbán, with the collaboration of experts on international and national research projects funding, Martín Cacheiro and Eva Fabeiro, and Urania Dafni, director of the greek node of the International Network Frontier Sience.

From what was shown, it seems like the year ahead is going to be a very busy and productive one for the Network and its members. All that we have left to say is… We are already looking forward to the 3rd General Meeting!!

LONG LIFE TO BIOSTATNET!!!

Your comments on the Meeting and this review are very welcome,       let´s keep the spirit up!

Featured

Natàlia Adell is a graduate in Statistics from the Universitat Politècnica de Catalunya. She also has a master´s degree in Statistical and Operations Resarch from the same University. She worked in KantarMedia and in the Statistical Service of the Universitat Autònoma de Barcelona. At present, she works in the Statistical Assessment Unit of the Research Technical Services of the University of Girona.

+34 680778844

http://www.udg.edu/str/uae

1. Why do you like Biostatistics?

Because I like applied Statistics and if you can contribute to a good cause such as decreasing the number of illnesses, you will have all the right ingredients for good science.

2. Could you give us some insight in the work you develop at the Statistical Assessment Unit of the UdG´s Research Technical Services?

My main perception is that people need statisticians to help with a part of their research, studies… Statistics is a science that other scientists need and the Statistical Assessment Unit tries to provide it.

3. What were the main difficulties you found when setting up the unit?

The main difficulty was getting started. We had to organise the unit, establish all the procedures, and also let the community know about us. The most important thing I had was the support of all the people around me, who helped every time I needed it (and still do).

4. Is it possible to combine consultancy/advice and research?

Well, in our case, we dedicate ourselves just to the consultancy and giving advice because doing research is not the aim of the Statistical Assessment Unit. But it might be possible to combine both, because some doubts arise from research, and some questions need a research approach so they can be related.

5. What do you think of the situation of young biostatisticians in Spain?

I think  biostatisticians usually work alone, without the support of other statisticians and, in my opinion, it would be interesting to share knowledge with other biostatisticians. So I hope that BioStatNet and FreshBiostats will allow that!

6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?

Listening, communicating and having a deep knowledge of Statistics. If you have these three characteristics, you can be a good biostatistician.

7. What do you think are the main qualities of a good mentor?

I think the most important skill is to be organised, knowing the steps you need to take to achieve your goal.  Explaining difficult technics in a clear way will also be appreciated.

8. Finally, is there any topic you would like to see covered in the blog?

Sample size could be a theme of interest!

Selected publications:

• Adell, N., Puig P., Rojas-Olivares, A., Caja, G., Carné, S. and Salama, A.A.K. A bivariate model for retinal image identification. Computers and Electronics in Agriculture. 2012; 87: 108-112. Epub 2012 June.
• M. A. Rojas-Olivares, G. Caja, S. Carné, A. A. K. Salama, N. Adell, and P.Puig. Determining the optimal age for recording the retinal vascular pattern image of lambs.  Journal of Animal Science. 2012; 90 (3): 1040-6. Epub 2011 Nov 7.
• Rojas-Olivares M.A., Caja G., Carné S., Salama A.A.K., Adell N., Puig P. Retinal image recognition for verifying the identity of fattening and replacement lambs. Journal of Animal Science. 2011; 89 (8): 2603-13. Epub 2011 Feb 4.
• Martínez-Vilalta J, López BC, Adell N, Badiella L & Ninyerola M (2008). Twentieth century increase of Scots pine radial growth in NE Spain shows strong climate interactions. Global change biology. 2008; 14, nº 12: 2868-2881.
Featured

# Another important R: Relative Risk

María Álvarez Hernández, BSc in Mathematics (University of Salamanca), is a PhD student in Statistics and Operations Research at the University of Granada, where she works with Professor Martín Andrés. Her line of research is framed within the statistical analysis of categorical data from contingency tables. Contact María

One of the common objectives of Health Sciences is to compare the proportions of individuals with a feature of interest in two different populations, for which purpose it is usual to take two independent samples. This is the case of comparing the proportion of cures with two different treatments, or the proportion of patients in the groups with and without a particular risk factor. In such situations, the parameter of interest is the difference between two proportions, but in the field of Medicine the parameter of interest is usually the ratio of two proportions. Examples about this are clinical trials which evaluate the effectiveness of a new vaccine, studies for comparing two binary diagnostic methods, studies of the comparison of two different treatments, etc.

From an exact point of view, getting a confidence interval for R is computationally very intensive, it requires specific computer programmes and it isn’t feasible for moderately large sample sizes (Reiczigel et al., 2008). Hence researchers have devoted a great attention to how to obtain approximate confidence intervals and, although many different procedures have been proposed, these have not always been compared. Nowadays, there is a general consensus that the best procedure is the score method proposed by Koopman (1984) and by Miettinen and Nurminen (1985). Alternatively, other simpler methods have been proposed which work more or less well (Farrington and Manning, 1990; Dann and Koch, 2005; Zou and Donner, 2008).

One piece of research in which I am involved is to improve these methods and to suggest  new ones that will allow us to achieve a result closer to the exact one, without losing rigor in the process (Martín and Álvarez, 2012). But although the improvement may be in a theoretical level, what happens in the computing scene?

From a practical point of view, obtaining confidence intervals for the relative risk through statistical packages such as SPSS20, Stata12 or StatXact10, also focuses on the asymptotic case, although in some of them, the researcher can actually obtain the exact confidence interval (in some situations incurring a long computational time). In general, the methods used are based on the ideas of Miettinen & Nurminen (1985) where it is assumed a standard normal distribution, Katz et al (1978) who applied the logarithmic transformation, and Koopman (1984) with the reputed score method. Sometimes, as it is the case of the StaXact software, it is allowed to apply the Berger & Boos correction because it reduces conservatism (it would result in shorter confidence intervals).

The aim must be not only to obtain the best methods in a theoretical way but also those that are more feasible when we carry out the explicit calculation and that involve shorter computational times.

Therefore, although the theory evolves, programmed routines in statistical packages to make inferences, for example about a measure of association like the relative risk, have not kept the pace like other techniques, considering that for the Health sector is a priority case.

In short, we should not be content with the implemented procedures and will spare no effort on research resources that allow us to improve them quickly and easily.

Featured

# Introduction to Bayesian statistics

After starting the year with a post about the International Year of Statistics, we present this week our first post about Bayesian statistics. This post has been made jointly by Hèctor Perpiñán and Silvia Lladosa.

Any researcher (particularly all of those working in the field of statistics) is aware of the two main approaches to this science: Frequentist or Classical statistics and Bayesian statistics. The main difference between Bayesian and Frequentist is essentially a distinct interpretation of what probability means, and thus a different way to make inference.

The term Bayesian refers to Bayes theorem that was originally started by the Reverend Thomas Bayes (1702–1761) in one of his last papers called “An Essay towards solving a Problem in the Doctrine of Chances ” published in 1763 (note that this year is the 250th anniversary).

Into this post, we focus on introducing the basic aspects that characterize the Bayesian framework. First of all, we give a brief and simple definition on the principal idea of Bayesian statistics: it quantifies and combines all the uncertainty in the problem (data, parameters, etc.) in probabilistic terms. It is therefore understood as the degree of belief.

The basic procedure of Bayesian methodology involves:

• Assigning an initial probability distribution, $\pi(\theta)$, to the model parameters ($\theta$), which quantifies all the relevant information in them. This distribution must be chosen before seeing the data (it can not by any means be conditioned by these).

Bayesian statistics has been often criticized because the interpretation of prior probability distribution in terms of ‘beliefs’ seems subjective. But this is far from reality, you can choose different priors: subjective (it should be used when you have some information about the parameters) or objective (in situations where there is no information on them).

Although it has not been explicitly mentioned above, the fact that we can express our beliefs about the parameter by means of  a probability density function is the result of considering the parameters as random variables. This is one of the biggest differences that can be found with respect to classical statistics. It treats parameters as fixed but unknown.

• Choosing a probabilistic model that relates the random variables and the model parameters associated with the experiment. This allows us to express the information provided by the data, given the parameters, in probabilistic terms by using the likelihood function, $p(y|\theta)$.

The last step in this procedure is to apply Bayes theorem, to combine prior knowledge and new information to find the posterior probability distribution, $\pi(\theta|y)$, of $\theta$,

$\pi(\theta|y)=\frac{p(y|\theta)\pi(\theta)}{\pi(y)}\propto p(y|\theta)\pi(\theta)$ .

The posterior distribution is updated according to the data, i.e. prior probability is changed by the new evidence provided by the data information into posteriors. We can say that “Today’s posterior is tomorrow’s prior ”. This final distribution will allow us to calculate point estimates of parameters, credible intervals estimates, to make predictions, etc.

After this brief introduction to Bayesian methodology, we will continue in our next posts with: prior distributions, Bayesian hierarchical models, WinBUGS and much more.

We hope your fears are going aside and you start to use this powerful paradigm. Because as a professor once told us: “Bayesian statistics is a way of life “.

Featured

# Happy New (International Statistics) Year!

With the start of the new year last Tuesday, it is now time to make resolutions and plans for the 12 months ahead. For scientists and especially for those who are involved in statistical matters, it will be a special one, since 2013 has been declared as the International Year of  Statistics by the American Statistical Association, the Institute of Mathematical Statistics, the International Biometric Society, the International Statistical Institute (and the Bernoulli Society), and the Royal Statistical Society. This year commemorates major events that were determinant for the evolution of Statistics. As mentioned on the International Statistics Institute website, “2013 will be the 300th anniversary of Ars Conjectandi, written by Jakob Bernoulli and considered a foundational work in probability…  and the 250th anniversary of the first public presentation of Thomas Bayes’ famous work.”

One of the FreshBiostats initial aims is to promote amongst young researchers, and within our limits, this science that is a complete unknown for many people, but plays at the same time a very important role in many other more popular fields -like Biology or Medicine in the particular case of Biostatistics.

It is with great pleasure that we find as one of the main objectives of Statistics2013 “nurturing Statistics as a profession, especially among young people”, and makes us very proud to be one of the participating groups supporting this and other also important goals. Hopefully, this initiative will contribute to the exponential trend that has been noticed in the interest of students towards this topic (you can find graphical representations of Harvard´s stat concentration enrollment here).

For further information, you can visit the website http://www.statistics2013.org/ and watch the launch video here.

We hope to make a significant contribution to this fantastic year, how will you take part in the celebration? Time is running out!!

Featured

# Dose Finding Experiments

In a past entry, I spoke about some issues in clinical trial design, explaining their structure and different phases.  Now, I am focusing in a part of these trials, the two first phases, which could be also a complete experiment by themselves, i.e. dose finding experiments.

The aim of a dose-finding experiment is a safe and efficient drug administration in humans. When a new drug (or procedure) is under study, we want to determine a safe dose of the drug for application but this dose should also be efficient. A balance between these two goals, non-toxicity and efficiency, is required in clinical trials.

Ethical concerns become essential in these experiments, same as in any experiment conducted on human beings, but especially because they are first-in-man studies, so the safety of the participants is the main worry. They also have a very small sample size, usually about 20 patients, therefore becoming a problem for the statistical analysis.

In phase 1 trials, the target is the evaluation of the maximum tolerated dose (MTD), the highest dose level with a pre-established observed toxicity rate. Depending on the risk of the experiment, a toxicity rate is fixed and the maximum dose level which does not exceed this toxicity rate is chosen. Then the recommended dose for the next phases of the study is either the MTD or one dose level less than the MTD. In phase 2 trials, we have an analogous experiment but now the target is the minimal effective dose, MED, which is the minimum dose level with a fixed efficiency rate. It is also common to try to combine these two targets in a single experiment, estimating a toxicity-efficiency curve and looking for an optimal dose that mixes these two goals.

A wide catalogue of designs for dose-finding experiments could be found in the statistical literature. The initial dose, the dose escalation, the stopping point and accuracy in the estimation of the MTD and MED are the main concerns in a design and they still are a fertile area for ​​research. A classical design in phase 1 is the traditional 3+3 design, very used in oncology experiments. In this design, patients are assigned in groups of three and the trial starts in the lowest dose level.  The first three patients are assigned and if it does not show any toxicity, we assign the next patients to the next level; if there is one case of toxicity, we repeat the experiment in the same level, and if there are two or more toxicities in the same level, we conclude that we have exceeded the MTD. This procedure is repeated until we exceed the MTD.

Phases 1 and 2  have been much less treated theoretically than phase 3 of clinical trials, but with this post I wanted to show up their  importance and try to make them more understandable to people who work in statistics.