Review of the 3rd Biostatnet General Meeting “Facing challenges in biostatistical research with international projection”


(Photo from Biostatech website)

Following on from successful meetings back in January 2011 and 2013,  on the 20-21 January 2017 Biostatnet members gathered again in Santiago to celebrate the network’s successes and discuss future challenges. These are our highlights:

Invited speakers

The plenary talk, “Why I Do Clinical Trials: Design and Statistical Analysis Plans for the INVESTED Trial”, introduced by Guadalupe Gómez Melis, was given by Prof. KyungMann Kim, from the Biostatistics and Medical Informatics department at the University of Wisconsin. Prof. Kim discussed challenges faced when conducting clinical trials to ensure follow-up of patients, and the technological and statistical conflicts in the endeavour to make large-scale clinical trials cost-efficient.


Prof. KyungMann Kim’s presenting his work

The invited talk by Miguel Ángel Martínez Beneito from FISABIO showcased solutions to issues with excess zeros modelling in disease mapping in a very enjoyable talk that generated a fascinating discussion.


Inmaculada Arostegui introducing Miguel Ángel Beneito’s talk


We really enjoyed the two great roundtables that were held at the meeting.

Young Researchers Roundtable

Firstly, we were delighted to be given the opportunity to organise a roundtable of young researchers at the event, and although we did not have much time, we managed to squeeze in four main topics of discussion including; Biostatnet research visits, reproducibility in research, professional visibility and diversity with a focus on women in Biostatistics (very fittingly just before the celebrations of the 1st International Day of Women and Girls in Science!). The topics proved to be of great interest and raised a lively discussion. Regarding visibility, issues such as how to properly manage a professional online profile and what the potential risks of too much exposure are, were raised in the discussion. Also mentioned was the fact that, while there is no doubt that accessing data and code for replicability and reproducibility purposes is highly important, researchers might lose sight of the conclusions due to a major focus on having full access to these resources.

Some other interesting issues were prompted that we could not expand on (we would have needed more time…!) so we would like to continue here…Feel free to send us your comments either here or via social media! or to answer this brief survey (post here). We are currently preparing a paper summarising the topics covered in the roundtable and we will let you know when it is ready!


Participants in the roundtable from right to left: Danilo Alvares (with contributions from Elena Lázaro), Miguel Branco, Marta Bofill, Irantzu Barrio and María José López.

BIO Scientific Associations Roundtable

Secondly, a very exciting and lively session gathering researchers with different backgrounds, all members of a variety of BIO associations and networks, gave us their impressions about what it means to work within a multidisciplinary team. In a very constructive and lively atmosphere, promoted by the moderator Erik Cobo (UPC), Juan de Dios Luna (Biostatnet), Vitor Rodrigues (CIMAGO), Mabel Loza (REDEFAR) and Marta Pavía (SCReN) discussed the pitfalls in communication between statisticians and researchers. We all really enjoyed this session, and took home some valuable messages to improve the interactions between (bio)statisticians, clinicians and applied researchers.


In addition to a satellite course on “CGAMLSS using R” (post to come on this topic!), we had the opportunity to attend two workshops on the last morning of the meeting.

Juan Manuel Rodriguez and Licesio Rodríguez delivered the Experimental Designs workshop. In a very funny and lively way (and using paper helicopters!!!!), they reviewed key concepts of Experimental Designs. This helpful workshop gave us a great opportunity to dust off our design toolkit.

In the software workshop moderated by Klaus Langohr, Inmaculada Arostegui and Guillermo Sánchez introduced two interactive web tools for predicting adverse events in bronchitis patients (PrEveCOPD) and biokinetic modelling (BiokmodWeb). Esteban Vegas showed a Shiny implementation of kernel PCA, and David Morina and Monica Lopez presented their packages radir and GsymPoint.

Oral communications

Although there were also presentations from less young researchers, there were plenty of sessions for younger members who have been getting a great support from the network (thank you, Biostatnet!). The jury decided that the best talks were “Beta-binomial mixed model for analyzing HRQoL data over time” by Josu Najera-Zuloaga. In his talk, he introduced an R-package, HRQoL, including the methodology to perform health-related quality of life scores in a longitudinal framework based on beta-binomial mixed models, as well as an application in patients with chronic obstructive pulmonary disease. The second awarded talk was “Modelling latent trends from spatio-temporally aggregated data using smooth composite link models. ” by Diego Ayma on the use of penalised composite link models with mixed model representation to estimate trends behind aggregations of data in order to improve spatio-temporal resolutions. He illustrated his work with the analysis of spatio-temporal data obtained in the Netherlands. Our own Natalia presented joint work with researchers from the Basque country node  on the application of Multiple Correspondence Analysis for the development of a new Attention-Deficit/Hyperactivity Disorder (ADHD) score and its application in the Imaging Genetics field. This work was possible thanks to one of Biostatnet grants for research visits.


Natalia’s presentation

Posters sessions

Three poster sessions were included in the meeting. As a novelty, these sessions were preceded by a brief presentation in which all participants introduced their posters in 1-2 minutes. Although imposing at first, we thought this was a good opportunity for everyone to at least hear about the work displayed, in case they missed the chance to either see the poster or talk to the author later on. The poster “An application of Bayesian Cormack-Jolly-Seber models for survival analysis in seabirds” by Blanca Sarzo showed her exciting work in a very instructive and visually effective way and won her a well-deserved award too.

Biostatnet sessions

Particularly relevant to Biostatnet members were also the talks by by three of the IPs, Carmen Cadarso, and Guadalupe Gómez and Maria Durbán (pictured below) highlighting achievements and future plans for the network. Exciting times ahead!


Last but not least, the meeting was a great opportunity for networking while surrounded by lovely Galician food and licor cafe ;p

We look forward to the 4th meeting!


Galician delicacies!

You can check the hashtag #biostatnet2017, tell us your highlights of the meeting here or send us your questions if you missed it…

(Acknowledgements: sessions pics by Moisés Gómez Mateu and Marta Bofill Roig)


Spatial objects in R (II)

In the last post I provided a simple introduction about spatial data and how to begin to manipulate them. Now, I would like to continue showing you some simple examples about what we could do with these data and the R tools we apply when working with them. Throughout this post I introduce you some links where R tools to process and learn about spatial data are available.

We continue using the same Spanish shapefile downloaded here.

Remember you need to load the library ‘maptools’ to read the shapefile.

#load the autonomous communities shapefile:
map1 <- readShapeSpatial('ESP_adm1.shp')

To plot this map you only need the ‘plot()’ function, but here I want to introduce a slightly different method of creating plots in R: the ‘ggplot2’ package. There is so much information about this package that you can find. For example, this web site http://ggplot2.org/ provides you with some useful books (R Graphics Cookbook (book website) and ggplot2: Elegant Graphics for Data Analysis (book website) where you can learn about this package. Also you can find presentations with examples and the corresponding R code.

‘ggplot()’ function requires some previous steps to prepare the spatial data: it needs to transform our SpatialPolygonsDataFrame into a standard  data frame. For this, ‘rgeos’ package must be installed.

map1.ggplot = fortify(map1, region="ID_1")  
#Now map1.ggplot is a data.frame object

‘fortity()’ function converts a generic R object into a data frame that needs the ggplot function. The ‘region’ argument determines  that the variable for each geographical area is identified and this should be common to both the map and the data.

A simple map is obtained with the following call to ggplot():

g <- ggplot(data = map1.ggplot, aes(x = long, y = lat, group = group)) +
     geom_polygon() +
     geom_path(color="white") + coord_equal()


If we want to add over the map a set of points:

#add points
locations <- read.csv('locations.csv',header=T)
g + geom_point(aes(long, lat,group=NULL), colour = "red", size = 0.3,  data = locations)
#locations should be a data frame object


If you want to test the code, you can generate points using the ‘locator()’ function and then create with them a data frame.

Finally, I would like to introduce the ‘over()’ function in the sp package to process spatial overlays and extractions. Let us suppose that we have a set of points over the region, and we only need the locations of one community. To use that function, we need to work with spatial objects.

# To create a SpatialPoints
coordinates(locations) <- c("long","lat")
# And to define the CRS:
proj4string(locations) <- CRS("+proj=utm +zone=30 +ellps=WGS84")</i></pre>

# Select the region of ‘Comunidad Valenciana’
map_Val <- map1[which(map1@data$NAME_1=="Comunidad Valenciana"),]
# CRS:
proj4string(map_Val) <- CRS("+proj=utm +zone=30 +ellps=WGS84")

pos<-which(!is.na(over(locations, geometry(map_Val))))
locnew <- locations[pos,]

pos stores the indices of the locations that are within the map_Val (discard the points that are out). After this point, you should follow the same steps to plot map_Val and the locnew points with the ggplot function.

I hope that beginners in spatial data find this information helpful!


Statistical or Clinical Significance… that is the question!

Most of the times, results coming from a research project – specifically in the health sciences field – use statistical significance to show differences or associations among groups in the variables of interest. Setting up the null hypothesis as no difference between groups and the alternative showing just the opposite –i.e, there is a relationship between the analyzed factors –, and after performing the required statistical method, a p-value is provided. This p-value indicates, under an established threshold of significance (say, Type I or alpha error), the strength of the evidence against the null hypothesis. If the p-value is lower than alpha, results lead to a statistically significant conclusion; otherwise, there is no statistical significance.

According to my personal and other biostatisticians’ experience in the medical area, most of physicians are only interested in the statistical significance of their main objectives. They only want to know whether the p-value is below alpha. But, the p-value, as noted in the previous paragraph, gives limited information: essentially, significance versus no significance and it does not show how important the result of the statistical analysis is. Besides from significance, confidence intervals (CI) and measures of effect sizes (i.e., the magnitude of the change) should be also included in the research findings, as they can provide more information regarding the magnitude of the relationship of the studied variables (e.g., changes after an intervention, differences between groups,…). For instance, CIs facilitate the range of values within the true difference value of the studied parameter lies.

In clinical research is not only important to assess the significance of the differences between the evaluated groups but also it is recommended, if possible, to measure how meaningful the outcome is (for instance, to evaluate the effectiveness and efficacy of an intervention). Statistical significance does not provide information about the effect size or the clinical relevance. Because of that, researchers often misinterpret statistically significance as clinical one. On one hand, a large sample size study may have a statistically significant result but a small effect size. Outcomes with small p-values are often misunderstood as having strong effect sizes. On the other hand, another misinterpretation is present when non statistical significant difference could lead to a large effect size but a small sample may not have enough power to reveal that effect.
Some methods to determine clinical relevance have been developed: Cohen’s effect size, the minimal important difference (MID) and so on. In this post I will show how to calculate Cohen’s effect size (ES) [1], which is the easiest one.

ES provides information regarding the magnitude of the association between variables as well as the size of the difference of the groups. To compute ES, two mean scores (one from each group) and the pooled standard deviation of the groups are needed. The mathematical expression is the following:

                                                         ES = \frac{\overline{X}_{G1}-\overline{X}_{G2}}{SD_{pooled}}

where X_{G1} = mean of the group G1; X_{G2} = mean of the group G2; and SD_{pooled} is the pooled standard deviation which follows the next formula:

                                                         SD_{pooled} = \sqrt{\frac{s^2_{1}n_{1}+s^2_{2}n_{2}}{n_{1}+n_{2}-2}}

being n_{1} = sample size for G1; n_{2} = sample size for G2; s_{1} = the standard deviation of G1; s_{2} = the standard deviation of G2;

But, how can it be interpreted? Firstly, it can be understood as an index of clinical relevance. The larger the effect size, the larger the difference between groups and the larger the clinical relevance of the results. As it is a quantitative value, ES can be described as small, medium and large effect size using the cut-off values of 0.2, 0.5 and 0.80.
Clinical relevance is commonly assessed as a result of an intervention. Nevertheless, it can be also extended to any other non experimental study design types, for instance, for cross-sectional studies.
To sum up, both significances (statistical and clinical) are not mutually exclusive but complementary in reporting results of clinical research. Researchers should abandon the only use of the p-value interpretation. Here you have a starting point for the evaluation of the clinical relevance.

[1] Cohen J. The concepts of power analysis. In: Cohen J. editor: Statistical power analysis for the behavioral sciences. Hillsdale, New Jersey: Academic Press, Inc: 1998. p. 1-17.


How to make an R package repository: an R-recipe

It is well known that R is one of the most used statistical package by the data scientists (researchers, technicians, etc..). This package allows flexibility to adapt already developed functions to their own interests. However, the problem comes when one needs to perform a complex algorithm that requires strong programming level or even to show to worldwide scientists how a new statistical technique works. When this scenario is present, it means that it is time to create your own package in R.

The most In this post we will learn how to make a simple R package only following several steps which are explained in the next paragraphs:

1. Update R version and other relevant tools.

First of all, if you are planning to develop a R package, it is highly recommended to update your R version to newest one. One must take into account that R versions are continuously updating so that your package would be likely to not to work properly. Furthermore, the package called Rtools is necessary to install in your system. This package has several add-ins (such as MinGW compiler) which are helpful for the development of the package.

2. Get to know your Operative System

For Windows System, you have to add to the PATH system variable the whole path where R is installed. For instance, if R is installed in c:, the PATH can be fixed as:



3. Decide the name and the kind of the package wanted to build.

This step is the most important. First of all, the package name can only be composed by numbers and letters and it is highly recommended to start it with a letter.

When building the functions belonging to the package, it is very important the arrangement of the files where the functions are written. There are two possible situations: (1) all functions in a file; (2) each function in its own file.  Both options are extreme since (1) writing all functions in a file could overload the system and (2) writing many functions in its own files implies having multiple files that should be packed. So, in order to search an alternative to these options, I suggest grouping related functions into the same files.

4. Edit some files

Once the previous steps are done, several files should be filled out/edited so that any R user who wants to use the functions within the package is able to understand what the function does.

First of all, it is obligatory to fill in  a file named DESCRIPTION. It is used to describe the package, with author, and license conditions in a structured text format that is readable by computers and by people.

Secondly, a tutorial of the package (called as VIGNETTES) should be written. An extended description with reproducible examples  shows the performance of each function within the repository.

5. Building, installing and executing the package

As final step, you have to mix all the components in a file by means of different software commands. Using the package.skeleton() function (see below) one can create the desired mypkg package:


where f, g, h, i¸ are all the functions belonging to the repository.

Once mypkg is created, the next step is to install it in our R system. You can use the common ways to perform it but I would advise using R commands to do that (R CMD INSTALL). Load the package and execute it. It is time to enjoy it!!

This has been a summary of the steps to follow for a R package repository development. Now it is your challenge!!!

Merry Christmas and Happy New Year!


Setting up your (Linux biostatistical) workstation from scratch


Facundo Muñoz, MSc in Mathematics, PhD in Statistics from University of Valencia. He is currently a postdoc researcher at the french Institute National de la Recherche Agronomique (INRA). His main research field is spatial (Bayesian) statistics applied to environmental and biostatistical problems. He is currently working on statistical methodologies for the analysis of forest genetic resources.

Being a busy biostatistician, I spend plenty of time glued to my computer. As an immediate consequence, every once in a while I need to set up my working station from zero. Either when I change job (I have done twice this year!), or when I want to update my OS version (upgrades rarely go perfect), or when I get a new laptop.

This involves, you know, installing the OS, the main programs I need for work like R and LaTeX, some related software like a good Control Version System, a couple of Integrated Development Environments for coding and writing, and a dozen of other ancillary tools that I use every now and then.

Furthermore, I need to configure everything to run smoothly, set up my preferences, install plugins, and so on.

Last time I did this manually, I spent a week setting everything up, and in the following days I always had something missing. Then I thought I should have got this process scripted.

Last week I set up my working environment in my new job. In a few hours I had everything up and running exactly the way I like. I spent an aditional day updating the script with new software, updated versions, and solving some pending issues.

I thought this script might be useful for others as well, hence this post. It is version-controled in a google code repository, where you can download the main script.

It is not very general, as installation details changes a lot from system to system. I use Linux Mint, but I believe it should go pretty straightforward with any derivative of Ubuntu, or Ubuntu itself (those distros using the APT package management). Other Linux branches (Arch, RedHat, Suse, Mac’s Darwin) users would need to make significant changes to the script, but still the outline might help. If you use Windows, well… don’t.

Of course, you will not be using the same software as I do, nor the same preferences or configurations. But it might serve as a guide to follow line by line, changing things to suit your needs.

In particular, it provides an almost-full installation (without unnecessary language packages) of the very latest LaTeX version (unlike that in the repos), and takes care of installing it correctly. It also sets up the CRAN repository and installs the latest version of R.

The script also installs the right GDAL and Proj4 libraries, which are important in case you work with maps in R or a GIS.

Finally, it installs some downloadable software like Kompozer (for web authoring), the Dropbox application, and more. It scrapes the web in order to fetch the latest and right versions of each program.

I hope it helps someone. And if you have alternative or complementary strategies, please share!


Interview with…Jorge Arribas


Jorge is a BSc Pharmacy from the UPV-EHU and he is currently a resident trainee in Microbiology and Parasitology at the H.C.U. Lozano Blesa in Zaragoza, working also on his PhD thesis.

Email: Jarribasg(at)salud(dot)aragon(dot)es

 1.     Could you give us some insight in your current field of research?

My PhD thesis focuses on the diagnosis of Hepatitis C Virus (HCV) infection by means of core antigen determination, and I am also collaborating in a line of research focusing on new treatments for H. pylori infection, funded by a FIS grant. The former, intends to perform a comparison with respect to the current technique in use for the diagnosis. The latter, analyses resistance to different Flavodoxin inhibitors.

2.     Where does Biostatistics fit in your daily work?

In both areas of research that I am working on, since they are essentially comparative studies against established techniques, and therefore require techniques to prove the significance – or lack of significance- of the improvements.

3.     Which are the techniques that you or researchers in your area use more often?

Statistical techniques such as sensitivity and specificity analysis, hypothesis testing (ANOVA, t-test). There is also a particular need in the area for techniques dealing with ordinal data.

4.     As a whole, do you find Biostatistics relevant for your profession?

A very important part of the speciality of Microbiology and Parasitology focuses on the research of new diagnostic methods, treatments, prevalence of antibiotic resistance, etc. Therefore, Biostatistics becomes extremely useful when comparing these novel approaches to previous ones.

5.     Finally, is there any topic you would like to see covered in the blog?

It would be great to see published some examples of statistical applications in my area of study.

Selected publications:

  • J. Arribas, R. Benito , J. Gil , M.J. Gude , S. Algarate , R. Cebollada , M. Gonzalez-Dominguez , A. Garrido, F. Peiró , A. Belles , M.C. Rubio (2013). Detección del antígeno del core del VHC en el cribado de pacientes en el programa de hemodiálisis. Proceedings of the XVII Congreso de la Sociedad Española de Enfermedades Infeccionas y Microbiología Clínica (SEIMC).
  • M. González-Domínguez, C. Seral, C. Potel, Y. Sáenz, J. Arribas, L. Constenla, M. Álvarez, C. Torres, F.J. Castillo (2013). Genotypic and phenotypic characterisation of methicillin-resistant Staphylococcus aureus (MRSA) clones with high-level mupirocin resistance in a university hospital. Proceedings of the 23nd European Congress of Clinical Microbiology and Infectious Diseases and 28th International Congress of Chemotherapy.
  • M. González-Domínguez, R. Benito, J. Gil,  MJ. Gude, J. Arribas, R. Cebollada, A. Garrido, MC. Rubio (2012).  Screening of Trypanosoma cruzi infection with a chemiluminiscent microparticle immunoassay in a  Spanish University Hospital. Proceedings of the 22nd European Congress of Clinical Microbiology and Infectious Diseases and 27th International Congress of Chemotherapy.