Please help us by filling in the following survey:
https://es.surveymonkey.com/r/76BW2XT
We look forward to your views!
Please help us by filling in the following survey:
https://es.surveymonkey.com/r/76BW2XT
We look forward to your views!
Pilar Cacheiro & Altea Lorenzo
After reading the news that PLOSGenetics is to actively solicit manuscripts from pre-print servers (PPS; read here) as a way to “improve the efficiency and accessibility of science communication”, we decided to write a quick overview on some of the most popular repositories.
Arxiv is probably the most widely known PPS as it has been available since 1991 and it mainly covers publications from the fields of mathematics, physics and computer science. Although a later development (2013), popularity of the PPS for biology bioRxiv is rapidly increasing, particularly in the fields of genomics and bioinformatics (see post here). SocARxiv for the social sciences is even more recent (July 2016), so we still have to wait to see how it is received by the community.
Some other repositories extend this feature to incorporate additional information. On Figshare, for instance, researchers can freely share their research outputs, including figures, data sets, images, and videos. GitHub, although mainly focused on source-code, also offers similar utilities (read previous posts here and here)
The main advantage of these PPS is the speed at which you make your work available to the scientific world therefore maximising the impact and outreach. Additionally, they allow for suggestions and comments from peers which make the process a more interactive one.
At a time when research consortiums are starting to require submission of manuscripts to online PPS ahead of peer review ( 4D Nucleome being a prominent recent example as reported on Nature news), and even governmental agencies (e.g., National Institutes of Health in US; read here) are enquiring about the possibility of allowing preprints to be cited in grant applications and reports, pre-prints are bound to play a big role in scientific research dissemination.
Related tools such as OpenCitations that provides information on downloads or citations of these pre-prints and Wikidata that serves as open data storage, are other examples of resources framed within the Creative Commons Public Domain philosophy of free open tools, and that will surely have a positive impact on efforts towards guaranteeing reproducibility and replicability of scientific research (read about a recent paper reproducibility hack event here).
We are keen to give it a try, are you?
Letting the world know about your work is important. This is true for any field of research, but for those of us working in any kind of computing-related field, it becomes even more relevant.
Professional networks, distribution lists, blogs and forums are an excellent way to keep up to date with research in your field (you can check a previous post on this topic here). Even social networks can be a great way to do so, depending on how you decide to use them. For instance, you can find quite a number of PostDoc positions posted on twitter.
There is another important aspect to consider as well. By being an active part of the community, not only do you gain visibility, but you also get the chance to contribute in return… How many issues have you solved through StackOverflow, Biostars, R-bloggers posts and so on? How many articles were you able to get through ResearchGate?
Back to the matter at hand, I have just recently found out about how to create a personal website with GitHub . If, like me, you do not know much about HTML programming, you might want to use the automatic page generator. I can tell you: if you have already figured out the content you want on it, and providing you already have a GitHub account, it will take you less than 5 minutes, (see a previous post on how to create an account here). Starting from a simple static web -simple is beautiful-, it can evolve into more sophisticated sites.
A set of online video tutorials recorded during the Cycle of Conferences “Bioestadística para Vivir” can be found here. These talks are aimed to the general public and try to show the impact of Biostatistics on everyday life, especially in the fields of health and the environment.
Biostatnet is one of the collaborating members of this project,“ Bioestadística para Vivir y Cómo Vivir de la Estadística”, which is promoted by the University of Santiago de Compostela, through its Unit of Biostatistics (GRIDECMB), and the Galician Institute of Statistics. This knowledge exchange initiative is funded by the Spanish Science & Technology Foundation (FECYT).
Other activities are being developed under this project; the exhibition Exploristica – Adventures in Statistics: an itinerant exhibition teaching Statistics to secondary school students, Cycles of Conferences, Problem solving workshops, etc. You can find out more here .
Hi again!
As active members of the National Biostatistics Network Biostatnet, we would like to invite everyone to participate in the Biostatnet Workshop on Biomedical (Big) Data that will take place the 26th and 27th of November 2015 in Centre de Recerca Matemàtica (Barcelona).
The aim of this workshop is to provide a meeting point for senior and junior researchers, interested in statistical methodologies, analysis and computational problems arising from biomedical problems with large and complex data sets.
Contributed oral and poster presentations on biostatistical methods, in general, and with a special emphasis on methods, problems and solutions for large data sets from the biomedical field, are welcome and the deadline for abstracts submission is October the 25^{th}, so hurry up if you don’t want to miss this great opportunity to show your research!
Aedin Culhane (Harvard University), Rodrigo Dienstmann (Vall d’Hebron Institute of Oncology), Jonathan Marchini (Oxford University), Giovanni Montana (King’s College London) and Simon Wood (University of Bath) will be the invited speakers.
Further details are available on the website of the workshop:
http://www.crm.cat/en/Activities/Curs_2015-2016/Pages/Biostatnet2015.aspx
Enjoy!
Natàlia Vilor-Tejedor holds a BSc in Mathematics and Applied Statistics from the Universitat Autònoma de Barcelona. Additionally, she holds a MSc in Omics Data Analysis from the Universitat de Vic where she also worked at the Bioinformatics and Medical Statistics Research group developing new methods for analyzing the individual evidence of SNPs in biological pathways and was involved in some GWAS data analyses. Currently, she is a PhD student in the Biomedicine programme at the Universitat Pompeu Fabra. At the same time, she is working at the Barcelona Biomedical Research Park (in the Centre for Research in Environmental Epidemiology) where she is working on new statistical methods for integrating different types of Omics, Neuroimaging and Clinical Data from the European BREATHE project. She received one of the two Biostatnet prizes for the best presentations at the JEDE III conference for her talk on “Efficient and Powerful Testing for Gene Set Analysis in Genome-Wide Association Studies”.
Contact info: nvilor(at)creal(dot)cat
Because I enjoy my work and I think that my research can help to improve different aspects of biomedicine and, in general, the population’s life quality.
Now I’m just starting my PhD thesis focused on the development of new mathematical methods to better understand the commonalities between Genetics and Neuroimaging, and how both affect Environmental phenotypes.
This is a relatively “new” area that has not been examined in depth, so I’m sure statistics can play an important role.
The most difficult part is to find a good mentor who is an expert in both areas and who is willing to help you. After that, you need to become familiar and understand different biological concepts that probably you haven’t come across before. Then, you have to extrapolate these concepts to a mathematical point of view, and finally (what is the most important and hardest part), you have to be able convey the information to both, geneticists and mathematicians.
All of these steps require an extra effort that is often difficult to overcome.
I think the most important quality is knowing how to impart the indispensable knowledge whilst being clear and well organised, but also it is important to instill confidence and provide development opportunities.
Although we are at a very delicate social-economic moment, I think we have a talented and very well prepared generation of young biostatisticians supported by important national institutions such as the Spanish Biometric Society and the Societat Catalana d’Estadística (in my case) and national networks such as BioStatNet and online resources like this blog that are helping us in our training.
I think this is a very interesting, motivating and emerging field of research where a lot of statisticians and mathematicians are required. From my personal experience, the vast majority of mathematicians/statisticians are not comfortable with biology, but I think that promoting interdisciplinarity is essential to ensure biomedical research.
Of course, even more genetics!
Selected publication and projects:
The fast-growing technological change and the need to innovate quickly has yielded collaboration requirements among groups coming from different places of the world. For instance, in large-scale worldwide scientific research projects, it is important to share all the information and keep it updated. The know-how relies on cloud-based systems, not only to coordinate and manage the information, but also to handle the computational efforts at different strategic peaks during the process of the project. “Ancient” resources such as offline-working cannot maintain this kind of development.
Google Drive and Dropbox are both currently the bestl-known cloud file storage services. The use of these platforms has heavily increased but, in the scientific field, another file repository has stepped forward: GitHub. This file hosting platform provides several collaboration features such as task management and features requests for every project, among others. It has become the main platform of the open-source development community that many scientists have begun considering it for a conventional file repository.
At first glance, it seems difficult to use – maybe it is the first time you have heard about GitHub, or never had the chance to try it!- but once you start using it, you will not replace it by any other file storage. Following carefully the basic steps, you will master it:
– Steps needed before getting started
1. Download the Git bash freeware application. Using few command lines, this application helps you to download/upload files from your working directory located in your desktop to the corresponding directory in the cloud.
2. If you do not have a github account (for example, username: freshbiostats), create it here. This site will be your working place in the cloud. The https addres of your site will be (for example): https://github.com/freshbiostats.
– Steps needed during the working process
3. Once you have a github profile, it is highly recommended to create a new repository (i.e., new folder) for each project you are working on. Choosing an adequate name (for example: fresh) describing the content of it, your files will be uploaded there so that you can share all the information with other people working on the same work assignment. These repositories are public by default, but you have the choice by making them private.
4. In your computer, make a directory, for example, named fresh. Using Git bash and via cd command, go to the created folder. For simplicity, this directory will be called from now as master. After it, type
git init
This initializes a git repository in your project. It means that you have entered in the github world and you are ready to work with this system.
5. It is time to create files and upload them to the cloud. In case there are no files in the master folder, and you have to create a document to upload to the web-storage system. To do that:
git status this command lists out the content of the folder
In case the master directory is empty, and you want to create a file “README.doc”, with a sentence “Hello World”:
touch README.doc “Hello World”
Now, once the file is created, the next step is to upload to the web. To this end, execute:
git add README.doc it stages the document to the directory
git commit -m “Readme document” it adds a label to the document.
git remote add origin https://github.com/freshbiostats/fresh.git
The last command line assigns the destination folder where the file is to be uploaded. It creates a remote connection, named “origin”, pointing at the GitHub repository just created.
6. After that, typing:
git push -u origin master
With this command line, you send all the files added to the master (called as fresh) to the repository located in the web (https://github.com/freshbiostats/fresh.git). AND…THAT’S ALL! The file is already in the cloud!.
Now, it is your turn, it is worth a try: GitHub, commit and push… !!
Prior to our presentation at the JEDE III conference on “assessment of impact metrics and social network analysis”, we’d like to get a flavour of the research dissemination that’s going on out there so we can catalogue some of the resources used by researchers in Biostatistics. As Pilar mentioned in previous posts (read here and here), we’re active users of online tools such as PubMed and Stackoverflow, and although we’re relatively new to aggregators such as ResearchBlogging, we really believe in their value as disseminators in the current Biostatistics arena.
Still, we feel we must be missing many others…So we’re interested both in what you use to advance your research but also how you promote it.
We’d be very grateful for your answers to the following questions (we’ll show a summary of the responses at the end of the month):
We welcome your further comments on this topic below and look forward to reading your answers.
Hopefully see you in Pamplona!
In the last post I provided a simple introduction about spatial data and how to begin to manipulate them. Now, I would like to continue showing you some simple examples about what we could do with these data and the R tools we apply when working with them. Throughout this post I introduce you some links where R tools to process and learn about spatial data are available.
We continue using the same Spanish shapefile downloaded here.
Remember you need to load the library ‘maptools’ to read the shapefile.
#load the autonomous communities shapefile: library(maptools) map1 <- readShapeSpatial('ESP_adm1.shp')
To plot this map you only need the ‘plot()’ function, but here I want to introduce a slightly different method of creating plots in R: the ‘ggplot2’ package. There is so much information about this package that you can find. For example, this web site http://ggplot2.org/ provides you with some useful books (R Graphics Cookbook (book website) and ggplot2: Elegant Graphics for Data Analysis (book website) where you can learn about this package. Also you can find presentations with examples and the corresponding R code.
‘ggplot()’ function requires some previous steps to prepare the spatial data: it needs to transform our SpatialPolygonsDataFrame into a standard data frame. For this, ‘rgeos’ package must be installed.
library(ggplot2) library(rgeos) map1.ggplot = fortify(map1, region="ID_1") #Now map1.ggplot is a data.frame object
‘fortity()’ function converts a generic R object into a data frame that needs the ggplot function. The ‘region’ argument determines that the variable for each geographical area is identified and this should be common to both the map and the data.
A simple map is obtained with the following call to ggplot():
g <- ggplot(data = map1.ggplot, aes(x = long, y = lat, group = group)) + geom_polygon() + geom_path(color="white") + coord_equal()
If we want to add over the map a set of points:
#add points locations <- read.csv('locations.csv',header=T) g + geom_point(aes(long, lat,group=NULL), colour = "red", size = 0.3, data = locations) #locations should be a data frame object
If you want to test the code, you can generate points using the ‘locator()’ function and then create with them a data frame.
Finally, I would like to introduce the ‘over()’ function in the sp package to process spatial overlays and extractions. Let us suppose that we have a set of points over the region, and we only need the locations of one community. To use that function, we need to work with spatial objects.
# To create a SpatialPoints coordinates(locations) <- c("long","lat") # And to define the CRS: proj4string(locations) <- CRS("+proj=utm +zone=30 +ellps=WGS84")</i></pre> # Select the region of ‘Comunidad Valenciana’ map_Val <- map1[which(map1@data$NAME_1=="Comunidad Valenciana"),] # CRS: proj4string(map_Val) <- CRS("+proj=utm +zone=30 +ellps=WGS84") pos<-which(!is.na(over(locations, geometry(map_Val)))) locnew <- locations[pos,]
pos stores the indices of the locations that are within the map_Val (discard the points that are out). After this point, you should follow the same steps to plot map_Val and the locnew points with the ggplot function.
I hope that beginners in spatial data find this information helpful!
In this blog we usually talk about applications of statistics in different fields of knowledge or about the software we use in it. However, this week and after the Easter holidays, I would like to address the issue of the dedication of the researchers.
Just a few weeks ago we could read at international press news that France will limit the access to work calls and emails outside of the workplace therefore cutting down the effective time of daily work to consultants and engineers as an effort to ultimately increase their productivity by having some proper rest. And I asked myself, do researchers take these breaks?
During these years of PhD, I have had many conversations with my colleagues about periods of stress and tools that could help us. Linking back to the news of our French neighbours, limiting working hours can certainly help those who feel that their workday never ends, either by lack of organization or excessive workload. This limitation would result in increased availability of time to perform other activities to relieve the mind and would allow us to return to work the following days rested and motivated, and to be, in the long run, more productive.
From my point of view, based on my experience as a researcher, the issue is more complicated in this case due to the general absence of a working day schedule. We must be able to juggle our sometimes “necessary” labour flexibility with performing other daily tasks. Here’s a rare case where flexibility, which is a priori evaluated positively, can hinder the organisation of our life.
My answer for some time now is to try to implement the GTD methodology in my life, proposed by David Allen on the book “Getting Things Done: The Art of Stress-Free Productivity”. GTD basically means to unify the management of my professional, personal, fun life etc..
The main lesson to take from GTD in research is to be clear about what goals we pursue so we can clearly address them with direct actions, in order to avoid devoting too much time to tasks that do not really help us move forward (professionally or personally).
Are you in this situation too? Share with us your tricks and solutions to avoid stress through better government of time!
I am currently taken an edX course, which is becoming one of my favorite online platforms, along with coursera. This time it is “Data Analysis for Genomics”, conducted by Rafael Irizarry, you might know him by his indispensable Simply Statistics blog. The course has just started, but so far the feeling is really good. It has been through this course that I have found about a new package by the great Hadley Wickham: dplyr. It is meant to work exclusively with data frames and provides some improvements over his previous plyr package.
As I spend quite a bit of time working with this kind of data, and having written a post some time ago about how to handle multiple data frames with the plyr package, I find it fair to update on this one. Besides I think it is extremely useful, and I have already incorporated some of its functions to my daily routine.
This package is meant to work with data frames and not with vectors like the base functions. Its functionalities might replace the ddply function in plyr package (one thing to mention is that is not possible to work with lists yet –as far as I know-). Four functions: filter( ) – instead of subset- , arrange( ) –instead of sort -, select( ) –equivalent to using select argument in subset function-, and mutate( ) – instead of transform- are, in my opinion, reason enough to move to this package. You can check here some examples on using these functions. The syntax is clearly improved and the code gets much neater, no doubt about it.
Two other essential functions are group_by( ) that allows to group the data frame by one or several variables and summarise( ) for calculating summary statistics on grouped data.
You can find general information about the package at the rstudio blog or several other blogs talking about its goodness, here or here.
Not to mention its other great advantage, the speed, not a minor issue for those of us who work with extremely large data frames. Several speed tests have been perfomed (here and here),( and it seems to clearly outperform the speed of plyr or data.table packages***).
I am so glad to have found it… I hope you will be too!
***Additional speed tests results will be published soon since this statement might be wrong.
Most of the times, results coming from a research project – specifically in the health sciences field – use statistical significance to show differences or associations among groups in the variables of interest. Setting up the null hypothesis as no difference between groups and the alternative showing just the opposite –i.e, there is a relationship between the analyzed factors –, and after performing the required statistical method, a p-value is provided. This p-value indicates, under an established threshold of significance (say, Type I or alpha error), the strength of the evidence against the null hypothesis. If the p-value is lower than alpha, results lead to a statistically significant conclusion; otherwise, there is no statistical significance.
According to my personal and other biostatisticians’ experience in the medical area, most of physicians are only interested in the statistical significance of their main objectives. They only want to know whether the p-value is below alpha. But, the p-value, as noted in the previous paragraph, gives limited information: essentially, significance versus no significance and it does not show how important the result of the statistical analysis is. Besides from significance, confidence intervals (CI) and measures of effect sizes (i.e., the magnitude of the change) should be also included in the research findings, as they can provide more information regarding the magnitude of the relationship of the studied variables (e.g., changes after an intervention, differences between groups,…). For instance, CIs facilitate the range of values within the true difference value of the studied parameter lies.
In clinical research is not only important to assess the significance of the differences between the evaluated groups but also it is recommended, if possible, to measure how meaningful the outcome is (for instance, to evaluate the effectiveness and efficacy of an intervention). Statistical significance does not provide information about the effect size or the clinical relevance. Because of that, researchers often misinterpret statistically significance as clinical one. On one hand, a large sample size study may have a statistically significant result but a small effect size. Outcomes with small p-values are often misunderstood as having strong effect sizes. On the other hand, another misinterpretation is present when non statistical significant difference could lead to a large effect size but a small sample may not have enough power to reveal that effect.
Some methods to determine clinical relevance have been developed: Cohen’s effect size, the minimal important difference (MID) and so on. In this post I will show how to calculate Cohen’s effect size (ES) [1], which is the easiest one.
ES provides information regarding the magnitude of the association between variables as well as the size of the difference of the groups. To compute ES, two mean scores (one from each group) and the pooled standard deviation of the groups are needed. The mathematical expression is the following:
where = mean of the group G1; = mean of the group G2; and is the pooled standard deviation which follows the next formula:
being = sample size for G1; = sample size for G2; = the standard deviation of G1; = the standard deviation of G2;
But, how can it be interpreted? Firstly, it can be understood as an index of clinical relevance. The larger the effect size, the larger the difference between groups and the larger the clinical relevance of the results. As it is a quantitative value, ES can be described as small, medium and large effect size using the cut-off values of 0.2, 0.5 and 0.80.
Clinical relevance is commonly assessed as a result of an intervention. Nevertheless, it can be also extended to any other non experimental study design types, for instance, for cross-sectional studies.
To sum up, both significances (statistical and clinical) are not mutually exclusive but complementary in reporting results of clinical research. Researchers should abandon the only use of the p-value interpretation. Here you have a starting point for the evaluation of the clinical relevance.
[1] Cohen J. The concepts of power analysis. In: Cohen J. editor: Statistical power analysis for the behavioral sciences. Hillsdale, New Jersey: Academic Press, Inc: 1998. p. 1-17.
Arantzazu Arrospide
Arantzazu Arrospide Elgarresta studied mathematics in the University of the Basque Country (UPV/EHU) and works as a biostatistician in the Research Unit of Integrated Health Organisations in Gipuzkoa. This research unit gives support to four regional hospitals (about 100 beds each one) and all the public Primary Care Health Services in Gipuzkoa.
Email: arantzazu.arrospideelgarresta@osakidetza.net
and
Irantzu Barrio
Acting teacher at the Department of Applied Mathematics, Statistics and Operational Research of the University of the Basque Country (UPV/EHU)
Email: irantzu.barrio@ehu.es
Both young biostatisticians are currently working on several ongoing research projects. They belong to the Health Services Research on Chronic Patients Network (REDISSEC) – among others biostatisticians – and tell us what they think about Biostatistics.
1. Why do you like Biostatistics?
Irantzu Barrio: On one hand I like applying statistics to real problems, data sets and experiments. On the other hand, I like developing methodology which can contribute to get better results and conclusions in each research project. In addition, I feel lucky to work in multidisciplinary teams. This allows me to learn a lot from other areas and constantly improve on mine own, always looking for ways to provide solutions to other researchers needs.
Arantzazu Arrospide: I think Biostatistics is the link between mathematics and the real world, giving us the opportunity to feel part of advances in scientific research.
2. Could you give us some insight in your current field of research?
AA: Our main research line is the application of mathematical modeling the evaluation of public health interventions, especially economic evaluations. Although Markov Chain models are the most common methods for this kind of evaluations we work with discrete event simulation models which permit more flexible and complex modeling.
IB: I’m currently working on my PhD thesis. One of the main objectives of this work is to propose and validate a methodology to categorize continuous predictor variables in clinical prediction model framework. Specifically we have worked on logistic regression models and Survival Models.
3. You have been doing an internship abroad. What was the aim of your stay?
IB: I did an internship in Guimaraes at the University of Minho, Portugal. During my stay, I worked together with Luis Filipe Meira Machado and María Xosé Rodriguez-Alvarez. The aim was to learn more about survival models and extend the methodology developed so far, considering different prediction models.
AA: I did a short stay in the Public Health department of the Erasmus Medical Centre in Rotterdam (Netherlands) last November. The aim of the visit was to discuss the validation of a discrete event simulation model developed to estimate the health effects and costs of the breast cancer screening program in the Basque Country.
4. What did allow you to do that was has not been possible in Spain?
IB: Oh! It’s amazing when you realize you have all your time to work on your research project, one and a unique priority for more than two months. Of course, all the other to do’s did not disappeared from my calendar, only were postponed until my return to Bilbao. And, in addition to that, it was also a privilege to work together with high experienced biostatisticians and to have the opportunity to learn a lot from them.
AA: The research group I visited, internationally known as the MISCAN group, is the only European member of the Cancer Intervention and Surveillance Modeling Network (CISNET) created by the National Cancer Institute in the United States. Their main objective is to include modeling to improve the understanding of the impact of cancer control interventions on population trends in incidence and mortality. These models then can project future trends and help determine optimal control strategies. Currently, Spanish screening programs evaluation is mainly based on the quality indicators recommended by the European Screening Guidelines which do not include a comparison with an hypothetical or estimated control group.
5. Which are the most valuable aspects to highlight during your internship? What aspects do you believe that might be improved?
IB: I would say that my internship was simply perfect. When I came back to Bilbao I just thought time had gone really really fast. I’m just looking forward to go back again.
AA: This group works for and in collaboration with their institutions. They are the main responsible of evaluation of ongoing screening programs, prospective evaluation of screening strategies and leaders for new randomized trials in this topic. This is the reference group in the Netherlands for cancer screening interventions and their institutions consider their conclusions when making important decisions.
6. What do you think of the situation of young biostatisticians in Spain?
AA: When you work in a multidisciplinary research group both methodological and disease specific knowledge are essential and it takes a long time to achieve it. Institutional support is necessary to obtain long term funds that would ensure future benefits in healthcare research based on rigorous and innovative methods.
IB: I think the situations for young biostatisticians and for young people in general is not easy right now. And at least for what I see around me, there is lot of work to do for.
7. What would be the 3 main characteristics or skills you would use to describe a good biostatistician? And the main qualities for a good mentor?
AA: Open minded, perfectionist and enthusiastic. As for the mentor, he/she should be strict, committed and patient.
IB: In my opinion good skills on statistics, probability and mathematics are needed. But at the same time I think it is important to be able to communicate with other researchers such as clinicians, biologists, etc, specially to understand which are their research objectives and be able to translate bio-problems to stat-problems.
For me it is very important to have good feeling and confidence with your mentor. I think that having that, everything else is much easier. On the other hand, if I had to highlight some qualities, I would say that a good mentor would: 1) Contribute with suggestions and ideas 2) Supervise the work done and 3) be a good motivator.
8. Finally, is there any topic you would like to see covered in the blog?
IB: I think the blog is fantastic, there is nothing I missed in it. I would like to congratulate all the organizing team, you are doing such a good job!!! Congratulations!!!
AA: Although it is not considered part of statistical science operational research methods also can be of interest in our researches.
Selected publications (6):
Arrospide, A., C. Forne, M. Rue, N. Tora, J. Mar, and M. Bare. “An Assessment of Existing Models for Individualized Breast Cancer Risk Estimation in a Screening Program in Spain.”. BMC Cancer 13 (2013).
Barrio, I., Arostegui, I., & Quintana, J. M. (2013). Use of generalised additive models to categorise continuous variables in clinical prediction. BMC medical research methodology, 13(1), 83.
Vidal, S., González, N., Barrio, I., Rivas-Ruiz, F., Baré, M., Blasco, J. A., … & Investigación en Resultados y Servicios Sanitarios (IRYSS) COPD Group. (2013). Predictors of hospital admission in exacerbations of chronic obstructive pulmonary disease. The International Journal of Tuberculosis and Lung Disease, 17(12), 1632-1637.
Quintana, J. M., Esteban, C., Barrio, I., Garcia-Gutierrez, S., Gonzalez, N., Arostegui, I., Vidal, S. (2011). The IRYSS-COPD appropriateness study: objectives, methodology, and description of the prospective cohort. BMC health services research, 11(1), 322.
Mar, J., A. Arrospide, and M. Comas. “Budget Impact Analysis of Thrombolysis for Stroke in Spain: A Discrete Event Simulation Model.”. Value Health 13, no. 1 (2010): 69-76.
Rue, M., M. Carles, E. Vilaprinyo, R. Pla, M. Martinez-Alonso, C. Forne, A. Roso, and A. Arrospide. “How to Optimize Population Screening Programs for Breast Cancer Using Mathematical Models.”.
As in many other research areas, mixed models have become widely applied in sports science and related health issues. Within sports science, different branches account for the interests of different stakeholders; general managers, coaches, teams, supporters, scientists, academics. Human performance and medicine (treatment and prevention of injuries) lie behind them all and these models are a way to account for within subject variability, time-dependent covariates, and nested data.
On the competition side, efforts have been made in the literature to find ways to model player performance. Casals and Martínez (2013) approach this problem in the context of the NBA basketball league results by considering a balanced design where player is the random intercept and the forward stepwise approach to model selection by Pinheiro and Bates (2000) has been adopted to determine additional factors to be included in the final model. The great advantage of their proposed models for points and win scores over previous studies is that they account for the variation in players´ characteristics and therefore can predict potential variation in player performance using Linear Mixed Models (LMM) and Generalized Linear Mixed models (GLMM), and are consequently of great help for coaches, managers and other decision makers.
A similar methodology has been followed to predict scoring ability in the English football Premier League by McHale and Szczepański (2014). You may recall the post by Hèctor Perpiñán on calculating results probabilities via simulation. While in Perpiñán´s study only teams´ results were considered for the calculations, McHale and Szczepański´s mixed modelling approach allows for the inclusion of players´ ability to create and convert shots as a disaggregated factor from chance. The accuracy in the prediction of their models in this paper shows again the relevance of models that allow the inclusion of players´ characteristics (rather than just teams´).
Of particular note to our readers is the trend towards the implementation of mixed models in open source software exemplified in the aforementioned papers which use R (R Core Team, 2012) for their modelling, in particular packages nlme and lme4.
In community exercises for promoting physical activity like the one described in Okely et al. (2011), one research aim has been to determine policies and interventions that help to encourage exercising during adolescence in order to alleviate risk factors in the long run. This particular school-based randomised controlled trial used mixed models to account for the hierarchical structure of the data (32 schools from four geographical regions). According to the authors, one of the greatest advantages of the methodology is that it “incorporates all available data allowing for the analysis of partial datasets created when a participant drops out of the study or misses a study visit.”
Moving on to health applications, injury prediction, for instance in baseball pitchers and athletes, can be modelled by deploying similar approaches to determine the existence of statistically significant differences between groups and within days post-injury for example.
Finally, in veterinary epidemiology mixed modelling has become equally mainstream, as discussed in a recent article by Stryhn and Christensen (2014). As an example of an application, animal risk factors associated with health disorders in sport can also be modelled via these techniques. Visser et al. (2013) have studied the effect in race horses in the Netherlands via a cross-sectional study and have considered a random farm effect. Commercial software has been used in these two previous examples. – SAS (PROC MIXED) and GenStat (GLMM)- which are again of common use in the application of mixed models.
As stated by Casals and Martínez (2013), phenomena like Moneyball have raised the profile of sports data analysis. For researchers, big data and more widely, the intrinsic complex structure of the data coming from the aforementioned fields –and I would add software availability- have stimulated application of these types of models and they seem to be here to stay…
These are some of the examples that we have found but we will sure be missing some other very interesting ones so please let us know…Are you a researcher in the area and would like to tell us about your experience? Have you used this sort of model in this or other areas and are you willing to share your experience?
Main references:
Casals, M., & Martinez, J.A. (2013). Modelling player performance in basketball through mixed models Journal of Performance Analysis in Sport, 13 (1), 64-82
McHale, I., & Szczepański, L. (2014). A mixed effects model for identifying goal scoring ability of footballers Journal of the Royal Statistical Society: Series A (Statistics in Society), 177 (2), 397-417 DOI: 10.1111/rssa.12015
Okely, A., Cotton, W., Lubans, D., Morgan, P., Puglisi, L., Miller, J., Wright, J., Batterham, M., Peralta, L., & Perry, J. (2011). A school-based intervention to promote physical activity among adolescent girls: Rationale, design, and baseline data from the Girls in Sport group randomised controlled trial BMC Public Health, 11 (1) DOI: 10.1186/1471-2458-11-658
Visser, E., Neijenhuis, F., de Graaf-Roelfsema, E., Wesselink, H., de Boer, J., van Wijhe-Kiezebrink, M., Engel, B., & van Reenen, C. (2013). Risk factors associated with health disorders in sport and leisure horses in the Netherlands Journal of Animal Science, 92 (2), 844-855 DOI: 10.2527/jas.2013-6692
Manuel G. Bedia is an assistant professor in the Department of Computer Science at the University of Zaragoza. He is one of the founders of the Spanish Network of Cognitive Science (retecog.net). This network has been established to promote and coordinate research in Cognitive Systems with goals overlapping those of the European Network EUCognition but with more emphasis on the relationships between scientific and educational policies, and the Spanish university system. He holds a BSc in Physics, a MSc.in Technological Innovation management and a Ph.D. in Computer Science and Artificial Intelligence (Best PhD Thesis Award, 2004), all from the University of Salamanca (Spain). He has worked as a Technological Consultant in Innovation and knowledge management (Foundation COTEC, Madrid, Spain) and as a research fellow in the field of artificial cognitive systems in the Department of Computer Science at the University of Salamanca, the Planning and Learning Group at the University Carlos III of Madrid (Visiting Professor, 2005-07) and the Multidisciplinary Institute at the University Complutense of Madrid (2004-07). He has also been a visiting postdoctoral researcher at the Institute of Perception, Action and Behavior (University of Edinburgh, 2005) and the Centre for Computational Neuroscience and Robotics at the University of Sussex, 2008.
1. Your area of research is Cognitive Sciences. Could you give us a brief introduction about the focus of your work?
Cognitive science is a space for interdisciplinary research where we aim to understand how the mind works. It joins together neuroscientists, psychologists, philosophers, engineers and of course statisticians too!
During the past five decades, analogies between the human mind/brain and computer software/hardware have led the work of researchers trying to understand how we think, reason and solve problems.
However, over the last few years, new conceptions have arisen doubting this conceptualisation. The biggest influence behind this change in perspective has come from engineers rather than scientists; in particular a group of engineers using the disciplinary tools of engineering to generate new scientific hypotheses instead of applying knowledge generated from other areas.
In a reversal of the usual role of engineers using models for the development of artifacts, the process develops tools to think about mind phenomena.
2. Could you give us an example of this?
Imagine we purposefully build a very simple artifact or software program that is capable of performing a certain task in a novel way. This proves the existence of explanatory alternatives to phenomena that were supposed to work in a certain way. In the words of other authors, the models serve as “mental gymnastics”. They are entities equivalent to classical mental experiments: They are artifacts that help our thinking. These tools are the foundations of modelling exercises: dynamic systems, probability theory, etc.
3. Is probability an important tool in your work?
It is indeed very important and relevant at many levels of the research in this area.
At a fundamental level, the mathematical languages that the early Artificial Intelligence (AI) researchers developed were not sufficiently flexible (they were based on the use of logic and rule systems) to capture an important characteristic of our intelligence: its flexibility to interactively reorganise itself. This led to a growing interest in tools that would embrace this uncertainty.
Recently a very interesting approach has been developed in the area where fundamental principles are based on probability: Artificial General Intelligence (AGI). The original goal of the AI field was the construction of “thinking machines” – that is, computer systems with human-like general intelligence. Due to the difficulty of this task, for the last few decades, the majority of AI researchers have focused on what has been called “narrow AI” – the production of AI systems displaying intelligence regarding specific, highly constrained tasks. In recent years, however, more and more researchers have reapplied themselves to the original goals of the field recognising the necessity and emergent feasibility of treating intelligence holistically. AGI research differs from the ordinary AI research by stressing the versatility and entirety of intelligence. Essentially, its main objective was to develop a theory of Artiﬁcial Intelligence based on Algorithmic Probability (further explanations can be found here).
At a more concrete level, there are several examples. For instance, it is well known that the reasoning model of the clinical environment is fundamentally Bayesian. The clinicians analyse and reflect on previous conditions and status of patients, before reaching a diagnosis of their current condition. This fits very well with the whole idea of Bayesian probability. Following the same line of reasoning, probability appears as a fundamental tool to model artificial minds thinking as humans.
In general, this Bayesian framework is the most used in our field.
4. How can this be applied in your area of research?
The Bayesian framework for probabilistic inference provides a general approach to understanding how problems of induction can be solved in principle, and perhaps how they might be solved in the human mind. Bayesian models have addressed animal learning , human inductive learning and generalisation, visual perception, motor control, semantic memory , language processing and acquisition , social cognition, etc.
However, I believe that the most important use comes from the area of neuroscience.
5. So what is the neuroscientific viewpoint in the field of the understanding of our mental functions, the Cognitive Sciences?
Neuroscience intends to understand the brain from the neural correlates that are activated when an individual performs an action. The advances in this area over the years are impressive but this conceptual point of view is not without problems. For instance, as Alva Noë states in his famous book Out of Our Heads, the laboratory conditions under which the measurements are taken substantially affect the observed task…This is a sort of second order cybernetics effect as defined by Margaret Mead decades ago. The history of neuroscience also includes some errors in the statistical analysis and inference phases…
6. Could you explain this further?
In the early 90s, David Poeppel, when researching the neurophysiological foundations of speech perception, found out that none of the six best studies of the topic matched his methodological apparatus (read more here).
Apparently, these issues were solved when functional magnetic resonance imaging (fMRI) emerged. As this technique was affordable it allowed more groups to work on the topic and indirectly forced the analytical methods to become more standardised across the different labs.
However, these images brought in a new problem. In an article in Duped magazine Margaret Talbot described how the single inclusion of fMRI images in papers had arguably increased the probability of these being accepted.
7. You have also mentioned that big mistakes have been identified in the statistical analysis of data in the area. What is the most common error in your opinion?
In 2011 an eye-opening paper was published on this topic (find it here). The authors focused their research on the misreported significance of differences of significance.
Let’s assume one effect is statistically significantly different from controls (i.e. p<0.05), while another is not (p>0.05). On the surface, this sounds reasonable, but it is flawed because it doesn’t say anything about how different the two effects are from one another. To do this, researchers need to separately test for a significant interaction between the two results in question. Nieuwenhuis and his co-workers summed up the solution concisely: ‘…researchers need to report the statistical significance of their difference rather than the difference between their significance levels.’
The authors had the impression that this type of error was widespread in the neuroscience community. To test this idea, they went hunting for ‘difference of significance’ errors in a set of very prestigious neuroscience articles.
The authors analysed 513 papers in cognitive neurosciences in the five journals of highest impact (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience). Out of the 157 papers that could have made the mistake, 78 use the right approach whereas 79 did not.
After finding this, they suspected that the problem could be more generalised and went to analyse further papers. Out of these newly sampled 120 articles on cellular and molecular neuroscience published in Nature Neuroscience between 2009 and 2010, not a single publication used correct procedures to compare effect sizes. At least 25 papers erroneously compared significance levels either implicitly or explicitly.
8. What was the origin of this mistake?
The authors suggest that it could be due to the fact that people are generally tempted to attribute too much meaning to the difference between significant and not significant. For this reason, the use of confidence intervals may help prevent researchers from making this statistical error. Whatever the reasons behind the mistake, its ubiquity and potential effect suggest that researchers and reviewers should be more aware that the difference between significant and not significant events is not itself necessarily significant.
I see this as a great opportunity and a challenge for the statistical community, i.e., to contribute to the generation of invaluable knowledge in the applied areas that make use of their techniques.
Selected publications:
Bedia, M. & Di Paolo (2012). Unreliable gut feelings can lead to correct decisions: The somatic marker hypothesis innon-linear decision chains. FRONTIERS IN PSYCHOLOGY. 3 – 384, pp. 1 – 19 pp. 2012. ISSN 1664-1078
Aguilera, M., Bedia, M., Santos, B. and Barandiaran, X. (2013). The situated HKB model: How sensorimotor spatialcoupling can alter oscillatory brain dynamics. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE. 2013. ISSN 1662-5188
De Miguel, G and Bedia, M.G. (2012). The Turing Test by Computing Interaction Coupling. HOW THE WORLD COMPUTES: TURING CENTENARY CONFERENCE AND 8TH CONFERENCE ON COMPUTABILITY IN EUROPE, CIE 2012. Cambridge, ISBN 3642308694
Santos, B., Barandiaran, X., Husband, P., Aguilera, M. and Bedia, M. (2012). Sensorimotor coordination and metastability in a situated HKB model. CONNECTION SCIENCE. 24 – 4, pp. 143 – 161. 2012. ISSN 0954-0091
In this post I would like to offer a brief and practical introduction to different types of spatial data that we could handle in R and how to do to access and visualize them. It might be interesting for those who are beginners in spatial analysis.
The structure of spatial objects is complex, and it is very important to know how these are stored and organized. sp package provides a clear form to organize the spatial data through classes (to specify the structure ) and methods (particular functions for a special data class).
The basic type of spatial objects is Spatial class, in which we could recognize two principal components, called slots: the first one is a bounding box that contains a matrix with the coordinate values. The second is a CRS class object, the reference system for these coordinates (“longlat” for geographical coordinates, “utm” for UTM coordinates, ect). With the command getClass(“Spatial”) (once sp package has been installed), we can examine the different types of spatial objects and their subclasses.
I will use an example to make this something practical, specifically we use a Spanish cartography (“ESP_adm.zip”, downloaded in this web site: http://www.gadm.org/). To use the ‘readShapeSpatial’ function to read the .shp object, we need the library ‘maptools’.
library(maptools) # load the shapefile: map <- readShapeSpatial('ESP_adm2.shp') # map is a SpatialPolygonsDataFrame, a subclass of a SpatialPolygons object.
A Spatial Polygons object is a closed line, a sequence of point coordinates where the first point is the same as the last point. It is not easy to know their structure, let us look at this briefly:
# examine the structure of this SpatialPolygonsDataFrame object. # Each slot (5) is determined by ‘@name_slot’ str(map, max.level = 2)
But, what means each slot?
# The provinces: map@data$NAME_2 # or slot(map, ‘data’)$NAME_2 # Idem for autonomous community with ‘$NAME_1’
str(map@polygons[[1]]@Polygons)
Finally, I will give just a simple example of how to plot some geographical coordinates in the ‘map’. For this, we read a data file with the locations and associate a spatial structure creating a SpatialPoints class. We need to be careful with the reference system of the SpatialPolygons and the SpatialPoints we used. The coordinate reference system for our shapefile is latitude/longitude and the WGS84 datum. Thus, we need the same CRS to our SpatialPoints. But…what if we do not have geographic coordinates? Let’s see it!
Assume that the coordinate reference system of the SpatialPoints is UTM (Universal Transverse Mercator) instead of latitude/longitude.
loc_utm <- read.csv(“loc_utm.csv”,header=T)
> loc_utm
X Y
1 718970.0 4376275.0
2 505574.2 4793683.2
3 276066.1 4542666.9
4 538135.4 4750010.9
There are four locations situated (in order) in Valencia, Bilbao, Salamanca and Santiago de Compostela. The first three points are in UTM zone 30 and the last one in zone 29. It is important to know this to define the CRS correctly:
# We separate these points according to the zone: loc1 <- loc_utm[1:3,] loc2 <- loc_utm[4,] # To create a SpatialPoints: coordinates(loc1) <- c("X","Y") coordinates(loc2) <- c("X","Y") # To define the CRS: proj4string(loc1) <- CRS("+proj=utm +zone=30 +ellps=WGS84") proj4string(loc2) <- CRS("+proj=utm +zone=29 +ellps=WGS84") # spTransform function provide transformation between projections. library(rgdal) loc1_geo<- spTransform(loc1, CRS("+proj=longlat +ellps=WGS84")) loc2_geo<- spTransform(loc2, CRS("+proj=longlat +ellps=WGS84"))
Now we can plot these points over the Spanish mapping.
plot(map) plot(loc1_geo,add=T) plot(loc2_geo,add=T)
This post has been a short summary of some of Spatial objects that we can manipulate in R. In the following posts I will invite you to continue learning about this area of statistics. I hope it helps someone!
Following previous posts on longitudinal models (see posts) and interviews where mentions have been made to survival analysis (read one of these interviews), we will focus today on models incorporating both methodologies, the so-called Joint Models.
Often problems pose more than a single approach. For example, when HIV data are analyzed, we can on the one hand focus on studying the evolution of CD4 cell counts (longitudinal model) and on the other hand, we can also study the time to death (survival analysis model). Joint modeling will allow both models to share information and will achieve better results.
From a practical point of view, in R, we can use two packages called JM and JMbayes to fit these models, under either the frequentist or the Bayesian paradigm, respectively. The use of these packages is not difficult because they are based on the use of packages nlme and survival with which you might already be familiar.
Here I show you how to use them:
library(JM) library(JMbayes)
# Fit the longitudinal model lmeFit.aids <- lme(CD4 ~ obstime * drug, random = ~obstime | patient, data = aids) # Fit the survival model survFit.aids <- coxph(Surv(Time, death) ~ drug, data = aids.id, x = TRUE)
# Fit the joint model (Frequentist) jointFit.aidsFREQ <- jointModel(lmeFit.aids, survFit.aids, timeVar = "obstime") # Fit the joint model (Bayesian) jointFit.aidsBAYES <- jointModelBayes(lmeFit.aids, survFit.aids, timeVar = "obstime")
It’s possible to specify different characteristics about the joint model. With JM package various options for the survival model are available. With JMbayes it is possible to choose the type of association structure between the longitudinal and survival processes.
Finally, if you are interested in these models and are thinking about using them in your research, a good reference is the book Joint Models for Longitudinal and Time-to-Event Data: With Applications in R of Dimitris Rizopoulos, author of JM and JMbayes.
Note: The databank to fit the longitudinal model must have as many rows as observations per patient have (aids). On the other hand, the survival model requires a database that has only one observation for each patient over time to death and censoring indicator (aids.id).
Animations can be a refreshing way to make a website more attractive, as we did exactly a year ago here.
Based on the Brownian motion simulated here, and using animation and ggplot2 R packages, we produced a fun welcome to the International year of Statistics, Statistics2013. The function saveMovie (please notice that there exists another alternative, saveGIF) allowed us to save it and finally publish it, as easy as that!
For this new year we have based our work on the demo(‘fireworks’) from the 2.0-0 version of the package animation (find here a list of the changes in each version) and world from the package maps (speaking of maps, have a look at this great dynamic map by Bob Rudis!).
They also come very handy when trying to represent visually a dynamic process, the evolution of a time-series,etc.
As an example, in a previous post we portrayed an optimisation model in which the values for the mean were asymptotically approaching the optimal solution, which was achieved after a few iterations. This was done with a for loop and again packages ggplot2 and animation.
There are many other examples of potential applications both in general Statistics – see this animated representation of the t distribution and these synchronised Markov chains and posterior distributions plots– and in Biostatistics. Some examples of the latter are this genetic drift simulation by Bogumił Kamiński and this animated plots facility in Bio7, the integrated environment for ecological modelling.
R package caTools also allows you to read and write images in gif format.
Finally, in LaTeX, \animategraphics from the animate package will do the trick; check this post by Rob J. Hyndman for further details.
It is certainly one of our new year’s resolutions to incorporate more animations in our posts, what are yours?
Anabel Forte Deltell, graduated in mathematics and statistics, holds a PhD in Statistics from the Univeristy of Valencia. Now she is a lecturer at the economics department of Universitat Jaume I at Castellón. Her research topics cover many fields of statistics as, for instance, spatial analysis or joint modeling of longitudinal (or panel) data and time-to-event data with her main interest beeing Bayesian model selection and, in particular, Objective Bayesian variable selection. Her dissertation (2011) entitled “Objective Bayes Criteria for Variable Selection” focuses on this last topic. She has published various articles in international journals and participated in many national and international meetings.
Email: forte@uji.es
1. Why do you like Biostatistics?
Researching in biostatistics is researching in real life and it makes me feel like I’m helping to improve people´s life conditions.
2. Could you give us some insight in your current field of research?
Despite my main line of research being Bayesian model selection, I’m actually working in joint modeling of longitudinal data and time-to-event data. Nowadays many practitioners are moving to what it is called personalized medicine with medical decisions, practices, and/or products being tailored to the individual patient. In this scenario longitudinal data seems to be of great importance since it allows for the consideration of several sources of uncertainty including patient-specific variability. Moreover, quantifying the risk of suffering a certain event of a patient given its trajectory seems really sensible and can be done using joint modeling.
3. Which are, in your opinion, the main advantages of being a researcher?
For me, being a researcher is simply a way of life. I can not imaging myself doing anything else.
4. Your whole professional experience has been within the public sector and the University. How do you see the present and future of research in the Spanish public sector?
From my point of view if things do not change greatly, researching in Spain, at least in the public sector, has a really black future. As I see it, the only way researchers can manage to get money is to call attention of private investors. But private money can compromise the objectivity of research and at the same time avoid researching in some fields that are not so “attractive” for industry. Hence it is something that has to be studied carefully… but something has to be done, that´s for sure.
5. What do you think of the situation of young biostatisticians in Spain?
As I see it, young biostatisticians have to make a great effort to show their enormous value for private companies as, for instance, food processing companies or health care related companies among others… And I think so because, pitifully, nowadays the Spanish public system can not assume all the biostatistical research that is needed.
6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?
For me it is really important to have a good mathematical/probabilistic base and to be well organized when working. But the most important of all is enjoying research.
7. Which do you think are the main qualities of a good mentor?
For me, a good mentor is someone that supervises your work and is always there for you but at the same time he or she should give you some space for you to fail and learn… in other words someone that teaches you what researching really is.
Selected publications:
After twelve months filled with exciting events celebrating the great role of statistics in society, it is now time to say goodbye to the International Year of Statistics, Statistics2013.
Illustration: Cinta Arribas
From FreshBiostats we would like to thank the steering committee for making of this such a fantastic opportunity for professionals and students at all levels to interact and share the pride in what we do. We would like to send a special thank you to Ronald Wasserstein and Jeffrey Myers who helped us feel involved in the celebration with our small contributions to their newsletter and blog.
The good news is that it seems like there will be a new project starting in late January. We are really looking forward to The World of Statistics! – find here a whole lot of new activities already arranged for 2014-.
It is well known that R is one of the most used statistical package by the data scientists (researchers, technicians, etc..). This package allows flexibility to adapt already developed functions to their own interests. However, the problem comes when one needs to perform a complex algorithm that requires strong programming level or even to show to worldwide scientists how a new statistical technique works. When this scenario is present, it means that it is time to create your own package in R.
The most In this post we will learn how to make a simple R package only following several steps which are explained in the next paragraphs:
1. Update R version and other relevant tools.
First of all, if you are planning to develop a R package, it is highly recommended to update your R version to newest one. One must take into account that R versions are continuously updating so that your package would be likely to not to work properly. Furthermore, the package called Rtools is necessary to install in your system. This package has several add-ins (such as MinGW compiler) which are helpful for the development of the package.
2. Get to know your Operative System
For Windows System, you have to add to the PATH system variable the whole path where R is installed. For instance, if R is installed in c:, the PATH can be fixed as:
PATH=%PATH%;c:\Rtools\bin;c:\Rtools\perl\bin;c:\Rtools\MinGW\bin;
c:\texmf\miktex\bin;c:\R-2.15.0\bin;
3. Decide the name and the kind of the package wanted to build.
This step is the most important. First of all, the package name can only be composed by numbers and letters and it is highly recommended to start it with a letter.
When building the functions belonging to the package, it is very important the arrangement of the files where the functions are written. There are two possible situations: (1) all functions in a file; (2) each function in its own file. Both options are extreme since (1) writing all functions in a file could overload the system and (2) writing many functions in its own files implies having multiple files that should be packed. So, in order to search an alternative to these options, I suggest grouping related functions into the same files.
4. Edit some files
Once the previous steps are done, several files should be filled out/edited so that any R user who wants to use the functions within the package is able to understand what the function does.
First of all, it is obligatory to fill in a file named DESCRIPTION. It is used to describe the package, with author, and license conditions in a structured text format that is readable by computers and by people.
Secondly, a tutorial of the package (called as VIGNETTES) should be written. An extended description with reproducible examples shows the performance of each function within the repository.
5. Building, installing and executing the package
As final step, you have to mix all the components in a file by means of different software commands. Using the package.skeleton() function (see below) one can create the desired mypkg package:
package.skeleton(list=c(“f”,”g”,”h”,”i”),name=”mypkg”)
where f, g, h, i¸ are all the functions belonging to the repository.
Once mypkg is created, the next step is to install it in our R system. You can use the common ways to perform it but I would advise using R commands to do that (R CMD INSTALL). Load the package and execute it. It is time to enjoy it!!
This has been a summary of the steps to follow for a R package repository development. Now it is your challenge!!!
Merry Christmas and Happy New Year!
Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…
Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.
Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…
In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980:
library(RISmed) query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing" ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000) QueryCount(ngs_search) ngs_records <- EUtilsGet(ngs_search) years <- Year(ngs_records) ngs_pubs_count <- as.data.frame(table(years))
This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year:
I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal:
journal <- MedlineTA(ngs_records) ngs_journal_count <- as.data.frame(table(journal)) ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]
You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs.
Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog.
Search and analyze carefully!
An exciting workshop, organised as the capstone of the International Year of Statistics Statistics2013, was celebrated on the 11^{th} and 12^{th} of November at the Royal Statistical Society Headquarters in London, under the stimulating title of “The Future of the Statistical Science”, science to which John Pullinger, president of the Royal Statistical Society, refers to as a quintessentially, multidisciplinary discipline. The workshop was indeed a good example of that.
Online attendance was welcomed to the event giving an additional opportunity to listen to highly reputed professionals in main areas of Statistics. Virtual participants were also allowed to pose their questions to the speakers, an innovation that worked very well as an excellent way to make these sorts of events available to a wider audience that would otherwise be excluded – I am already looking forward to more! -.
In case you missed it at the time, luckily the podcast is still available here.
True to its description, the event covered different areas of application, it showed the tools of the field for tackling a wide range of challenges, it portrayed potentially high impact examples of research, and in summary, was a great attention-grabbing exercise that will hopefully encourage other professionals to the area. A common vision of ubiquity and great relevance came across from the speeches and showed the field as a very attractive world to join. See my simplified summary below in form of a diagram certainly reflective of the great Statistics Views motto “Bringing Statistics Together”.
Fig 1. Some of the ideas discussed in the workshop
Particularly relevant to this blog and my particular interests were presentations on environmental Statistics and statistical Bioinformatics, as well as other health-related talks. Important lessons were taken on board from the rest of the speakers too.
The workshop started with a shared presentation on the hot topic “Statistical Bioinformatics” by Peter Bühlmann (ETH Zürich) and Martin Vingron (Free University of Berlin). In a fascinating talk, Bühlmann argued for the need of uncertainty quantification in an increasingly heterogeneous data world –examples of this are also appearing in other fields, e.g. in the study of autism disorders as Connie Kasari and Susan Murphy mentioned in “SMART Approaches to Combating Autism”- and for models to be the basis of the assignation of uncertainties – topic greatly covered by Andrew Gelman in “Living with uncertainty, yet still learning”-, while acknowledging that in the big data context, “confirmatory statistical inference is (even more) challenging”. Vingron followed by focusing on “Epigenomics as an example”, raising open questions to the audience on how to define causality in Biology where “most processes […] are feedback circular processes”, calling for models that are just complex enough so as to allow for mechanistic explanations, and for good definitions of null hypotheses.
In addition to the interesting points of the talk, I found its title particularly attractive in what it could be directly linked to a vibrant roundtable on “Genomics, Biostatistics and Bioinformatics” in the framework of the 2nd Biostatnet General Meeting, in which, as some of you might remember, the definitions of the terms Biostatistics and Bioinformatics were discussed. I wonder if the term “statistical Bioinformatics” would be indeed the solution to that dilemma? As a matter of fact, Bühlmann himself mentions at the start of his talk other options like Statistical Information Sciences, etc-
Michael Newton and David Schwartz from the University of Wisconsin-Madison also focused on the triumph of sequencing and “Millimeter-long DNA Molecules: Genomic Applications and Statistical Issues”. Breast cancer being one of the mentioned applications, this was followed by an introduction to “Statistics in Cancer Genomics” by Jason Carroll and Rory Stark (Cancer Research UK Cambridge Institute, University of Cambridge), particularly focusing on breast and prostate cancer and the process of targeting protein binding sites. The latter, as a computational biologist, mentioned challenges on the translation of Statistics for the biologists and vice versa, and identified ”reproducibility as (the) most critical statistical challenge” in the area – also on the topic of reproducibility, it is especially worth watching “Statistics, Big Data and the Reproducibility of Published Research” by Emmanuel Candes (Stanford University) -.
Another area increasingly gaining attention in parallel with technologies improvements is Neuroimaging. Thomas Nichols (University of Warwick) lead the audience in a trip through classical examples of its application (from Phineas Gage´s accident to observational studies of the structural changes in the hippocampi of taxi drivers, and the brain´s reaction to politics) up to current exciting times, with great “Opportunities in Functional Data Analysis” which Jane-Ling Wang (University of California) promoted in her talk.
Also in the area of Health Statistics, different approaches to dietary patterns modelling were shown in “The Analysis of Dietary Patterns and its Application to Dietary Surveillance and Epidemiology” by Raymond J. Carroll (Texas A&M University), and Sue Krebs-Smith (NCI), with challenges in the area being; finding dietary patterns over the life course – e.g. special waves of data would appear in periods like pregnancy and menarchy for women-, and incorporation of technology to the studies as a new form of data collection –e.g., pictures of food connected to databases-.
Again with a focus on the challenges posed by technology, three talks on environmental Statistics outlined an evolution over time of the field. Starting from current times, Marian Scott (University of Glasgow) in “Environment – Present and Future” stated in her talk that “natural variability is more and more apparent” because we are now able to visualise it. This idea going back to the heterogeneity claimed by Bühlmann. Despite the great amounts of data being available, and initially caused by technological improvements but ultimately due to a public demand – “people are wanting what it is happening now”-, the future of the field is still subject to the basic questions: “how and where to sample.” Especially thought-provoking were Scott´s words on “the importance of people in the environmental arena” and the need for effective communication: “we have to communicate: socio-economic, hard science…, it all needs to be there because it all matters, it applies to all the world…”
Amongst other aims of this environmental focus such as living sustainably – both in urban and rural environments-, climate is a major worry/fascination which is being targeted through successful collaborations of specialists in its study and statisticians. “Climate: Past to Future” by Gabi Hegerl (University of Edinburgh) and Douglas Nychka (NCAR) covered “Regional Climate – Past, Present and Future” and showed relevant examples of these successful collaborations.
Susan Waldron (University of Glasgow), under the captivating speech title of “A Birefringent Future”, highlighted the need to address challenges in the communication of “multiple layers of information” too, Statistics appearing as the solution for doing this effectively –e.g. by creating “accessible visualisation of data” and by incorporating additional environmental variation in controversial green energy studies such as whether wind turbines create a microclimate-. Mark Hansen (Columbia University Journalism School) seconded Waldron´s arguments and called for journalists to play an important role in the challenge: “data are changing systems of power in our world and journalists. […] have to be able to think both critically as well as creatively with data”, so as to “provide much needed perspective”. Examples of this role are; terms such as “computational Journalism” already being coined and data visualisation tools being put in place -as a fascinating example, the New York Times´ Cascade tool builds a very attractive representation of the information flow in Social Media-. David Spiegelhalter (Cambridge University) also dealt with the topic in the great talk “Statistics for an Informed Public”, providing relevant examples (further explanation on the red meat example can be found here as well).
To encourage “elicitation of experts opinions (to reach a) rational consensus” and “to spend more time with regulators” came up as other useful recommendations in the presentation “Statistics, Risk and Regulation” by Michel Dacorogna (SCOR) and Paul Embrechts (ETH Zürich).
In a highly connected world inmersed in a data revolution, privacy becomes another major issue. Statistical challenges arising from networked data –e.g. historical interaction information modelling- were addressed by Steve Fienberg (Carnegie Mellon University) and Cynthia Dwork (Microsoft Research), who argued that “statisticians of the future will approach data and privacy (and its loss) in a fundamentally algorithmic fashion”, in doing so explicitly answering the quote by Sweeney: “Computer science got us into this mess, can computer science get us out of it?”. Michael I. Jordan (University of California-Berkeley) in “On the Computation/Statistics Interface and “Big Data”” also referred to “bring(ing) algorithmic principles more fully into contact with statistical inference“.
I would like to make a final mention to one of the questions posed by the audience during the event. When Paul Embrechts was enquired about the differences between Econometrics and Statistics, a discussion followed on how crossfertilisation between fields has happened back and forth. As a direct consequence of this contact, fields rediscover issues from other areas. For instance, Credit Risk Analysis models were mentioned as being inferred from Medical Statistics or back in Peter Bühlmann´s talk, links were also found between Behavioural Economics and Genetics. These ideas, from my point of view, bring together the essence of Statistics, i.e. its application and interaction with multiple disciplines as the foundations of its success.
Many other fascinating topics were conveyed in the workshop but unfortunately, I cannot fit here mentions to all the talks. I am sure you will all agree it was a fantastic event.
Facundo Muñoz, MSc in Mathematics, PhD in Statistics from University of Valencia. He is currently a postdoc researcher at the french Institute National de la Recherche Agronomique (INRA). His main research field is spatial (Bayesian) statistics applied to environmental and biostatistical problems. He is currently working on statistical methodologies for the analysis of forest genetic resources.
Being a busy biostatistician, I spend plenty of time glued to my computer. As an immediate consequence, every once in a while I need to set up my working station from zero. Either when I change job (I have done twice this year!), or when I want to update my OS version (upgrades rarely go perfect), or when I get a new laptop.
This involves, you know, installing the OS, the main programs I need for work like R and LaTeX, some related software like a good Control Version System, a couple of Integrated Development Environments for coding and writing, and a dozen of other ancillary tools that I use every now and then.
Furthermore, I need to configure everything to run smoothly, set up my preferences, install plugins, and so on.
Last time I did this manually, I spent a week setting everything up, and in the following days I always had something missing. Then I thought I should have got this process scripted.
Last week I set up my working environment in my new job. In a few hours I had everything up and running exactly the way I like. I spent an aditional day updating the script with new software, updated versions, and solving some pending issues.
I thought this script might be useful for others as well, hence this post. It is version-controled in a google code repository, where you can download the main script.
It is not very general, as installation details changes a lot from system to system. I use Linux Mint, but I believe it should go pretty straightforward with any derivative of Ubuntu, or Ubuntu itself (those distros using the APT package management). Other Linux branches (Arch, RedHat, Suse, Mac’s Darwin) users would need to make significant changes to the script, but still the outline might help. If you use Windows, well… don’t.
Of course, you will not be using the same software as I do, nor the same preferences or configurations. But it might serve as a guide to follow line by line, changing things to suit your needs.
In particular, it provides an almost-full installation (without unnecessary language packages) of the very latest LaTeX version (unlike that in the repos), and takes care of installing it correctly. It also sets up the CRAN repository and installs the latest version of R.
The script also installs the right GDAL and Proj4 libraries, which are important in case you work with maps in R or a GIS.
Finally, it installs some downloadable software like Kompozer (for web authoring), the Dropbox application, and more. It scrapes the web in order to fetch the latest and right versions of each program.
I hope it helps someone. And if you have alternative or complementary strategies, please share!
So here we are. It has been a year since we started this venture. The idea of a blog came up from one of our co-bloggers at the Jede II Conference in the summer of 2012. At first it sounded like a bit of a challenge, but who said fear?
No doubt about it, the balance has been highly positive. We are all for sharing knowledge and resources that might be valuable for others, and from our humble perspective we sincerely hope it might have been of some use. It has certainly been so for us, both by getting insight into particular subjects when writing the different posts and by diving into new topics covered by our co-bloggers and invited posts. Twitter and facebook have also allowed us to encourage interaction with colleagues and other bloggers, and we can now say our social and professional networks have certainly become bigger and stronger!
We have found it difficult at times to juggle our jobs and PhDs with writing our weekly posts but as we said in several occasions, we are passionate about our work, and firmly believe that, most of the time, the line between work and fun gets blurry.
As we promised in a previous post, here is an infographic summarising this year of Fun & Biostatistics, enjoy!
We have a very international audience with visits coming from 117 countries, and we are delighted to see that not only colleagues from our closest network are reading our entries. Since our participation on the International Statistics Year blog -Statistics2013- and after being mentioned in other blogs such as RBloggers and others, we have gained more visibility and some posts have become very popular (more than 1000 views for some!).
Those posts focusing on R tips clearly take the cake, being the most visited. We guess they might be the most useful ones, as we are also big fans of other very practical blogs. However we like to cover all the aspects of our profession and even sometimes deal with more controversial or philosophical subjects…
We will keep inviting people to share their knowledge and will encourage colleagues to get involved in the blog. Our second year resolution is to make an effort to make of this blog a more interactive tool. We count on you for that!
Remember you can contact us with your comments, suggestions, and enquiries at freshbiostats@gmail.com
Thank you so much for being there!
Jorge is a BSc Pharmacy from the UPV-EHU and he is currently a resident trainee in Microbiology and Parasitology at the H.C.U. Lozano Blesa in Zaragoza, working also on his PhD thesis.
Email: Jarribasg(at)salud(dot)aragon(dot)es
1. Could you give us some insight in your current field of research?
My PhD thesis focuses on the diagnosis of Hepatitis C Virus (HCV) infection by means of core antigen determination, and I am also collaborating in a line of research focusing on new treatments for H. pylori infection, funded by a FIS grant. The former, intends to perform a comparison with respect to the current technique in use for the diagnosis. The latter, analyses resistance to different Flavodoxin inhibitors.
2. Where does Biostatistics fit in your daily work?
In both areas of research that I am working on, since they are essentially comparative studies against established techniques, and therefore require techniques to prove the significance – or lack of significance- of the improvements.
3. Which are the techniques that you or researchers in your area use more often?
Statistical techniques such as sensitivity and specificity analysis, hypothesis testing (ANOVA, t-test). There is also a particular need in the area for techniques dealing with ordinal data.
4. As a whole, do you find Biostatistics relevant for your profession?
A very important part of the speciality of Microbiology and Parasitology focuses on the research of new diagnostic methods, treatments, prevalence of antibiotic resistance, etc. Therefore, Biostatistics becomes extremely useful when comparing these novel approaches to previous ones.
5. Finally, is there any topic you would like to see covered in the blog?
It would be great to see published some examples of statistical applications in my area of study.
Selected publications:
With the occassion of the 1st birthday of FreshBiostats, we want to remember some of the great statisticians born in September and that have contributed to the “joy of (bio)stats”.
Gerolamo Cardano | Pavia, 24 September 1501 – 21 September 1576 | First systematic treatment of probability | |
Caspar Neumann | Breslau, 14 September 1648 – 27 January 1715 | First mortality rates table | |
Johann Peter Süssmilch | Zehlendorf, 3 September 1707 – 22 March 1767 | Demographic data and socio-economic analysis | |
Georges Louis Leclerc (Buffon) | Montbard, 7 September 1707 – Paris, 16 April 1788 | Premier example in “geometric probability” and a body of experimental and theoretical work in demography | |
Adrien-Marie Legendre | Paris, 18 September 1752 – 10 January 1833 | Development of the least squares method | |
William Playfair | Liff, 22 September 1759 – London, 11 February 1823 | Considered the founder of graphical methods of statistics (line graph, bar chart, pie chart, and circle graph) | |
William StanleyJevons | Liverpool, 1 September 1835 – Hastings,13 August 1882 | Statistical atlas – graphical representations of time series | |
Anders Nicolai Kiaer | Drammen, 15 September 1838 – Oslo, 16 April 1919 | Representative sample | |
Charles Edward Spearman | London, 10 September 1863 – 17 September 1945 | Pioneer of factor analysis and Spearman´s Rank correlation coefficient | |
Anderson Gray McKendrick | Edinburgh, September 8, 1876 – May 30, 1943 | Several discoveries in stochastic processes and collaborator in the path-breaking work on the deterministic model for the general epidemic | |
Maurice Fréchet | Maligny, 2 September 1878 – Paris, 4 June 1973 | Contributions in econometrics and spatial statistics | |
Paul Lévy | 15 September 1886 – 15 December 1971 | Several contributions to probability theory | |
Frank Wilcoxon | County Cork, 2 September 1892 – Tallahassee, 18 November 1965 | Wilcoxon rank-sum tests, Wilcoxon signed-rank test | |
Mikhailo Pylypovych Kravchuk | Chovnytsia, 27 September 1892- Magadan, 9 March 1942 | Krawtchouk polynomials, a system of polynomials orthonormal with respect to the binomial distribution | |
Harald Cramér | Stockholm, 25 September 1893 – 5 October 1985 | Important statistical contributions to the distribution of primes and twin primes | |
Hilda Geiringer | Vienna, 28 September 1893 – California, 22 March 1973 | One of the pioneers of disciplines such as molecular genetics, genomics, bioinformatics,… | |
Harold Hotelling | Fulda, 29 September 1895 – Chapel Hill, 26 December 1973 | Hotelling´s T-squared distribution and canonical correlation | |
David van Dantzig | Rotterdam, 23 September 1900 -Amsterdam, 22 July 1959 | Focus on probability, emphasizing the applicability to hypothesis testing | |
Maurice Kendall |
Kettering, 6 September 1907 – London, 29 March 1983 | Random number generation and Kendall´s tau | |
Pao-Lu Hsu | Peking, 1 September 1910 – Peking, 18 December 1970 | Founder of the newly formed discipline of statistics and probability in China |
It is certainly difficult to think of the field without their contributions. They are all a great inspiration to keep on learning and working!!
Note: you can find other interesting dates here.
Update: and Significance´s timeline of statistics here.
Any author you consider particularly relevant? Any other suggestions?
Although the history of Infographics according to Wikipedia does not seem to mention Fritz Kahn as one of the pioneers of this technique, I would like to start this post mentioning one of the didactic graphical representations of this Jewish German doctor, who was highly reputed as a popular science writer of his time.
Apart from his fascinating views of the human body in the form of machines and industrial processes, I am particularly attracted by his illustration below, summarising the evolution of life in the Earth as a clock in which the history of humans would not take more than a few seconds…
Image extracted from the printed version of the article “Fritz Kahn, un genio olvidado” published in El País, on Sunday 1st of September 2013.
What could be understood by some as a naive simplification of matters requires, in my opinion, a great deal of scientific knowledge and it is a fantastic effort to communicate the science behind very complex mechanisms.
This and more modern infographic forms of visualisation represent an opportunity for statisticians –and more specifically biostatisticians-, to make our field approachable and understandable for the wider public. Areas such as Public Health (see here), Cancer research (find examples here and here), and Drug development (see here) are already using them, so we should not be ashamed to make of these “less rigorous” graphical representations an important tool in our work.
Note: There are plenty of resources available online to design a nice infographic in R. For a quick peek into how to create easy pictograms, check out this entry in Robert Grant´s stats blog. Also, the wordcloud R package will help you visualising main ideas from texts…
We will soon show a practical example of these representations in this blog, keep tuned!
The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go!
We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
You can load the Iris data and examine this data frame with:
data(iris) str(iris); summary(iris[1:4]) pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(iris$Species)+1) mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)
We will use “prcomp” R function to carry out the analysis, which is similar to “princomp” function.
As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.
In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.
#To examine variability of all numeric variables sapply(iris[1:4],var) range(sapply(iris[1:4],var)) # maybe this range of variability is big in this context. #Thus, we will use the correlation matrix #For this, we must standardize our variables with scale() function: iris.stand <- as.data.frame(scale(iris[,1:4])) sapply(iris.stand,sd) #now, standard deviations are 1
Now applied the prcomp() function to calculate the principal components:
#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix pca <- prcomp(iris.stand,scale=T) #it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand) #similar with princomp(): princomp(iris.stand, cor=T) pca summary(pca) #This gives us the standard deviation of each component, and the proportion of variance explained by each component. #The standard deviation is stored in (see 'str(pca)'): pca$sdev
In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:
#plot of variance of each PCA. #It will be useful to decide how many principal components should be retained. screeplot(pca, type="lines",col=3)
From this plot and from the values of the ‘Cumulative Proportion of Variance’ (in summary(pca)) we can conclude that retaining 2 components would give us enough information, as we can see that the first two principal components account for over 95% of the variation in the original data.
#The loadings for the principal components are stored in: pca$rotation # with princomp(): pca$loadings
This means that the first two principal component is a linear combination of the variables:
where are the standardization of original variables.
The weights of the PC1 are similar except the associate to Sepal.Width variable that is negative. This component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot). This one principal component accounts for over 72% of the variability in the data.
All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. When the iris has larger sepal and petal values than average, the PC2 will be smaller than average. This component explain the 23% of the variability.
The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.
#biplot of first two principal components biplot(pca,cex=0.8) abline(h = 0, v = 0, lty = 2, col = 8)
To interpret better the PCA results (qualitatively) would be useful to have the opinion of an expert in this area, as sometimes is somewhat confusing. I encourage you to participate!
Much of the work we do is based on statistical models used to reproduce real-life processes, yet it is not the only option. When we know the rules and restrictions that dominate a system, we can reproduce their behaviour using simulation techniques and with a few simple statistical calculations, we can get to know everything that we are interested in the system.
Coinciding with the start of the Spanish League 2013-2014 two weekends ago, myself and a group of friends were thinking about a series of questions about the future results of the League:
Will we have a League of two (Barcelona and Madrid)?
What are the chances of my team to win the League (I am a supporter of Valencia)?
Assuming that Madrid or Barcelona will win League, what options have the other teams to obtain the second position?
To answer these questions, I was challenged to simulate the current League using information from the last 5 seasons:
I downloaded the general statistics of the last five seasons from www.as.com website. With this information, I calculated the odds of home win (HW), home draw (HD), home loss (HL), away win (AW), away draw (AD) and away loss (AL) for each of the 20 current teams in Spanish League. (Note: Elche probabilities are the average of probabilities from the 10 teams that have been on the First Division in the last five leagues but are not on that division in the 2013-2014 league.)
From the League schedule I have calculated the probability of win, draw or loss for each league matches, being given for each match the names of the local and the visitor teams.
Pr(Barcelona 1st and Madrid 2nd or Madrid 1st and Barcelona 2nd) = Pr(Barcelona 1st and Madrid 2nd) + Pr(Madrid 1st and Barcelona 2nd) = 0.2930 + 0.2428 = 0.5358
Figure 1: Boxplot of the Points per Team in the 10000 Leagues simulated.
Besides the obvious conclusions we can draw from these results, we can see that we are clearly in a league of two. This sort of procedure will also allow us to emulate complex systems in which we know the rules for calculating the corresponding probabilities. For example, in Biostatistics we could work out the probabilities of an offspring being affected by a genetic disease if we know the probabilities of the parents being carriers (genetic problem).
If you are interested in the topic Simulation and Sports, I recommend reading the paper “On the use of simulation methods to compute probabilities: application to the Spanish first division soccer league” of Ignacio Díaz-Emparanza and Vicente Núñez-Antón, which explains in much more detail how to approach this problem from different points of views.
I have implemented all the calculations for this post with free software R.
Do you have experience in simulation? Tell us about it!!!
Online free courses/tutorials: there is plenty of material on line, which makes it sometimes difficult to filter what is really worthy. Here again, tips from blogs or colleagues from your network might serve as reference. Coursera is, in my opinion, one of the best platforms, due to the quality and versatility of its courses. There are several excellent courses related to Statistics and Data Analysis. Some of them are more general about R programming (e.g. Data Analysis,Computing for Data Analysis – both using R- ),but there are also more specific ones (e.g. Design and Interpretation of Clinical Trials, Statistical Analysis of fMRI Data,.. you can check the full list here.
I would like to mention here some other resources available for those with a math/statistics background who might be interested in getting some insight into genetics. As we mentioned previously in other posts, it is critical to understand the data you are dealing with and these websites will help you with that:
An extensive list of additional Online Genetics Education Resources can be found at the NHGRI site
For those wanting to get an introduction to NGS, there is a Next Generation Sequencing Practical Course at EMB-EBI Train online. A more advanced tutorial, showing the use of R/Bioconductor packages for High-Throuput Sequence Analysis can be found here.
There are, obviously, countless courses and tutorials about R and specific packages. Besides, GitHub is becoming more and more popular.By creating Gist on GitHub you can share your code quickly and easily, see a quick example here.
Webinars: many commercial sites offer highly focused free Webinars that might be of interest. For instance both Science and Nature host webcasts regularly.
Forums /discussion list: when you are stuck with something and you are not able to find a solution, specialized forums might come to the rescue. Either because your same question has been asked before, or because there is someone willing to help, you will most likely get your doubt solved. Two forums are particularly useful in my field, BioStar and SEQanswers. Talking about R programming, R-help from R Mailing List and Stack Overflow are two of the sites where you can found most of your doubts solved. Our life without them would be much more difficult for sure…
As I mentioned at the beginning of the previous post, it is sometimes difficult to find a balance between the time you spend learning and your more “productive” time. Besides for those of us whose work is also a passion, the line between work and personal interests becomes blurred quite often. And so we will spend much of our leisure time diving around new stuff that eventually will be useful in our work. Some might argue that the time spent in training or the amount of information you have access to might be overwhelming. Is it worth the effort? How much time should we invest in learning? Are we able to take advantage of what we learn? You can take a look at this video for more elaborate thoughts on the subject.
I hope the information contained in these posts might be useful… Your suggestions on additional resources will be more than welcome!
No doubt about it, we must keep up with news and advances in our area of expertise. In this series of two posts I just want to introduce the ways I find useful in order to achieve this goal. Staying up-to-date means not only knowing what is being done in your field but also learning new skills, tools or tricks that might be applied. I will save for last some thoughts about getting a proper work-learning balance and potential impact on productivity.
To be continued next week …..
In previous posts on this site, comparisons among the most used statistical packages available for scientists were conducted, pointing out their strengths and weaknesses.
Although nowadays there is a trend to use open source programs for data science development, there are some commercial statistical packages which are also important since they make our scientist life easier. One of them is called Statistical Analysis System® (SAS).
SAS System was founded in 1970s and since then it is leading product in data warehousing, business analysis and analytical intelligence. It is actually the best selling all-in-one database software. In other words, SAS can be described as an Exporting-Transformation-Loading (ETL), Reporting and Forecasting tool. This makes SAS a good option for Data warehousing. It could be also considered as a statistical software package that allows the user to manipulate and analyze data in many different ways.
The main basic component of the SAS program is the Base SAS module which is the part of the software designed for data access, transformation, and reporting. It contains: (1) data management facility – extraction, transformation and loading of data – ; (2) programming language; (3) statistical analysis and (4) output delivery system utilities for reporting usage. All these functions are managed by means of data and call procedures. In the following sections some introductory and basic examples will be described:
(a) Generating new data sets.
It is possible to generate new data sets by using the SAS Editor environment (place where the code is written and executed). Suppose we want to create a data set of 7 observations and 3 variables (two numerical and one categorical).
data one;
input gender $ age weight;
cards;
F 13 43
M 16 62
F 19 140
F 20 120
M 15 60
M 18 95
F 22 75
;
run;
The SAS code shown above will create the desired data set called “one”. As gender variable is categorical – the first variable-, its values are “M” for MALE and “F” for FEMALE. The dollar sign ($) is used when you have a text variable rather than a numerical variable (i.e., gender coded as M, F rather than as 1 denoting male and 2 denoting female).
(b) Statistical analysis
For the performance of a basic descriptive analysis, – percentages and mean computations – the next procedures should be executed:
· For frequencies and percentage computation.
proc freq data = one;
tables gender;
run;
· Descriptive analysis for continuous variables.
proc means data = one n mean std;
var age weight;
run;
As it can be observed, it is really easy to remember which statistical procedure to use according to the type of variable: for categorical data proc freq procedure and proc means for continuous data.
(c) Output
Another important issue is the way the results are shown. The SAS statistical package has improved this section. Before the release of the version 9.0 – I think so!- , one had to copy and paste all the results from the output window to a WORD document in order to get a proper and saved version of the results. From version 9.0, things have changed: All results can be delivered to a PDF, RTF or even HTML format. As SAS user, I can say that this has been a good idea, since not only I no longer have to waste lots of time doing copy-paste, but also the not so useful results can be left unprinted. That has been a great leap!!! For example, if you want to have the results in PDF format, you should follow the next instructions:
ods pdf;
proc means data = one n mean std;
var age weight;
run;
ods pdf close;
This code generates a PDF file where shows the mean and standard deviation of the age and weight variable of the one data set.
Because of its capabilities, this software package is used in many disciplines, including the medical sciences, biological sciences, and social sciences. Knowing the SAS programming language will likely help you both in your current class or research, and in getting a job. If you want to go further in SAS programming knowledge, The Little SAS Book by Lora Delwiche and Susan Slaughter is an excellent resource for getting started using SAS. I also used it when I started learning SAS. Now it is your turn!!
Have a nice holiday!
The topic of Mixed Models is an old-friend of this blog, but I want to focus today on the R code for these models.
Amongst all the packages that deal with linear mixed models in R (see lmm, ASReml, MCMCglmm, glmmADMB,…), lme4 by Bates, Maechler and Bolker, and nlme by Pinheiro and Bates are probably the most commonly used -in the frequentist arena-, with their respective main functions lmer and lme.
I am still unsure as to which one I would choose – if I had to -, but I am sharing here a summary of some of their capabilities, in case it can be of help:
Model specification
I will be using for all of the following examples the balanced dataset Orthodont from the package nlme, which reflects the change in an orthodontic measurement over time for a sample of 27 children (see Fig. 1).
Fig. 1. Spaghetti plots. Distance vs age by gender – dataset Orthodont
For simplicity´s sake, I will consider the following initial models with a simple random factor (please see ref. [3] for centering and further analysis):
model_lmer<-lmer(distance~age+Sex+(1|Subject),data=Orthodont) model_lme<-lme(distance~age+Sex, random=~1|Subject,data=Orthodont)
Tests
The results for t-tests and F-tests based on Restricted Maximum Lilkelihood (REML) can be found by using the following lines of code (you can add REML=FALSE to change this default setting):
summary(model_lmer)
…
Fixed effects:
Estimate Std. Error t value
(Intercept) 17.70671 0.83391 21.233
age 0.66019 0.06161 10.716
SexFemale -2.32102 0.76139 -3.048
…
(Please notice that the reference category in Sex can be changed by using relevel).
anova(model_lmer)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
age 1 235.356 235.356 114.8383
Sex 1 19.034 19.034 9.2875
Conditional t-tests and F-tests are used for assessing the significance of terms in the fixed effects in lme.
Both F and t conditional tests results are based on REML conditional estimate of the variance. This option will be the default, but we can specify method=”ML” for Maximum Likelihood estimates.
For the results from the conditional t-test testing the marginal significance of each fixed effect coefficient with the other fixed effects in the model, we can use:
summary(model_lme)
…
Fixed effects: distance ~ age + Sex
Value Std.Error DF t-value p-value
(Intercept) 17.706713 0.8339225 80 21.233044 0.0000
age 0.660185 0.0616059 80 10.716263 0.0000
SexFemale -2.321023 0.7614168 25 -3.048294 0.0054
…
For the results from the conditional F-test testing the significance of terms in the fixed effects model (sequentially by default), we can use:
anova(model_lme)
numDF denDF F-value p-value
(Intercept) 1 80 4123.156 <.0001
age 1 80 114.838 <.0001
Sex 1 25 9.292 0.0054
(The argument type would also allow us to specify marginal F-tests).
These conditional tests for fixed-effects terms require denominator degrees of freedom, which will be the focus of the next section.
Degrees of freedom
The denominator degrees of freedom correspond to “the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs” [3]. It is worth noticing that the values for these degrees of freedom do not match those provided by other software procedures such as SAS PROC MIXED (see discussions on the topic here and here).
Additionally to the denominator degrees of freedom aforementioned, conditional F-tests also require numerator degrees of freedom defined by the term (see output from previous section).
p-values
Random effects
Random effects plots
Two different approaches to the plotting of the random effects can be obtained through the following lines of code:
plot(ranef(model_lme))
qqmath(ranef(model_lmer))
Fig. 2. Random effects plots for model_lme and model_lmer.
Residuals plots
res_lme=residuals(model_lme) plot(res_lme) qqnorm(res_lme) qqline(res_lme) plot(model_lme)
Fig. 3. Residual plots for model_lme.
Correlation structure
model_lme<-lme(distance ~ age + factor(Sex),random = ~ 1 | Subject, cor=corAR1(0.6,form=~1|Subject),data = Orthodont)
…
Correlation Structure: AR(1)
Formula: ~1 | Subject
Parameter estimate(s):
Phi
0.05849311
…
Further structures available can be found in the help for corClasses.
Heteroscedasticity
Although some other differences could be mentioned, future posts will cover those other matters.
The following books have been found extremely useful:
[1] Badiella, L., Sánchez, J.A. “Modelos mixtos con R versión 0.3” ( from the course “Modelos Mixtos utilizando R”)
[2] Bates, D. (2010). lme4: Mixed-effects modelling with R. Springer
[3] Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and S-Plus. Springer
[4] Zuur, A.F. et al. (2009). Mixed effects models and extensions in ecology with R. Springer
I will be more than happy to read your comments and tips! 🙂
Recent improvements in the technology to record data, have coincided with calls for making this data freely available. Health related studies are a particular case in point.
At the forefront of these changes, reputable publications have taken measures to set transparency standards. Since January for example, the British Medical Journal “will no longer publish research papers on drug trials unless all the clinical data on what happened to the patients is made available to anyone who wants to see it.” (Significance magazine Volume 9 Issue 6, December 2012)
In a sector that has often been accused of secrecy, GlaxoSmithKline are also engaged in this spirit of openness. They recently announced that they would make available “anonymised patient-level data from their clinical trials” to researchers “with a reasonable scientific question, a protocol, and a commitment from the researchers to publish their results” (ibid).
Fig 1. Death rates per 1000 in Virginia (USA) in 1940, VADeaths R dataset (for illustration only)
However, in the past few weeks two stories seem to challenge this trend towards greater transparency. At the same time as rumours grow in the UK of cuts in the publication of well-being data (The Guardian, 10^{th} of July 2013), controversy has arisen regarding the recently released National Health System (NHS) vascular surgeons individual performance records (BBC News 28^{th} of June 2013) .
While the measure has been welcomed by some sectors of the general public, there have been important criticisms coming from the medical side. Several doctors within the speciality, with perfectly satisfactory records, are refusing to agree to the metric. The argument is that different types and number of procedures coupled with the variability of prognoses make published indicators such as death rates misleading to the patients.
In general, calls have been made for further research into performance indicators that ensure information provided to the end-users is efficacious. As an example of this, back in 2011 when the first attempts to publicise this kind of information started, Significance magazine (Volume 8 Issue 3, September 2011) reported as one of the causes for the lack of success, failure to agree on “which indicators to use”, and also mentioned “discussions with the Royal College of General Practitioners to establish the most meaningful set.”
Tensions between opening up areas of genuine interest to the widest audience and ensuring that there are not unintended side effects, are a societal challenge in which statisticians can play a vital role: sometimes numbers cannot speak for themselves, and appropriate interpretations might be required to avoid wrong conclusions. This becomes particularly important when dealing with health issues…
Note: in Spain, it would seem that there is still much work to be done in terms of open data…. A PricewaterhouseCoopers report (pp. 120 -131) highlights the issue as one of the ten hot topics in the Spanish Health System for 2013, and welcomes the creation of the website www.datos.gob.es as one of the first steps towards openness in this and other sectors.
What are your thoughts on this issue? Are there any similar measures being taken in your country or organisation?
Laetitia is a graduate in Applied Mathematics and she also has a Master’s degree in Applied Statistics and Modelling from the University of Porto, Portugal. At present, she is a PhD student on Applied Mathematics and she works in the Research Unit UNIFAI (Institute of Biomedical Sciences Abel Salazar, University of Porto, Portugal)
Email: laetitiateixeir@gmail.com
1. Why do you like Biostatistics?
Biostatistics allows for the application of statistical theory to various areas of research practice and to work in various areas. Statistics and medicine are two areas of great interest for me and biostatistics allows me to work in both.
2. Could you give us some insight in your current field of research?
My PhD work focuses on survival analysis in the presence of competing risks. All the practical work is based on end-stage renal disease patients with peritoneal dialysis as renal function replacement therapy. We explore several statistical approaches, such as regression models taking competing risks into account, multistate models and joint models for longitudinal and competing risks data. Using these approaches, we can give more and better information about the disease progression, helping clinicians in the evaluation of patients and treatment planning.
Combined with my PhD, I am a fellow researcher at UNIFAI/ICBAS-UP, a Research Unit specialized in ageing and health.
3. Which are, in your opinion, the main advantages of being a researcher?
The opportunity to work in several areas with multidisciplinary teams.
4. What do you think of the situation of young biostatisticians in Portugal?
In Portugal, biostatisticians are mostly present in higher education institutions. Some public and private enterprises have been integrating some young biostatisticians, however in a very limited number. Some colleagues have gone to other European countries, where they have found better opportunities in this area.
5. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?
Interested in research, versatile and good communicator.
6. Which do you think are the main qualities of a good mentor?
Motivator, interested in research and dedicated, good communicator.
7. Finally, is there any topic you would like to see covered in the blog?
A list of working groups organized by research themes. This list would be important for young biostatisticians to find people according to working area and would allow students/researchers to create other networks.
Selected publications:
When faced with a statistical analysis task, we usually begin with a simple description about our data that will allow us to analyze and interpret our variables, with the aim of making a decision on some hypotheses that had been made at the start of the study.
This post deals with some statistical methods that are regarded as exploration techniques, but that go further than simple and usual descriptive. This talk will focus on correspondence analysis (CA) and principal components analysis (PCA), both are central to Multivariate Analysis.
PCA and CA are usually applied in high dimensional datasets with the principal objective being to reduce the dimensions of the data. Although these methods have their own particularities, both conclude to explain latent variables in the problem through observed data.
PCA: is widely used to capture essential patterns of big datasets. In high dimensional data sometimes it is difficult for researchers to extract interesting features, so one way to solve it is to reduce its dimensionality at the expense of losing information. It works by creating new uncorrelated variables (named as PCA) through linear combinations of original variables (in general correlated).
These PCA collect all information than original variables, and the goal is to select some PCA by preserving as much data variance as possible.
CA: unlike PCA, this methodology is applied in categorical data (without calculate linear combinations) as a procedure to analyze contingency tables. CA allows us to describe the relation between two nominal variables as well as the relation between the levels of themselves in a Cartesian axis.
The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis.
The applications of PCA and CA are wide and varied, in fields such as biology, ecology, social sciences, psychology, image processing … in which the number of variables is big. As we have said before, in that situation the PCA and CA provide us a method to extract latent variables and population intrinsic characteristics that have not been observed, so that we can think in these as a hypothesis generation system.
With ‘ca’ (‘mjca’) and ‘princomp’ packages of R we can apply Simple (Multiple) Correspondence Analysis and Principal Components Analysis into our data. The following figure illustrates a typical graphic of PCA, representing the first two components.
Here I have briefly commented a little aspects of two procedures to describe large datasets. In following posts I will try to do an example using R. Meanwhile, you can try and play with ‘ca’ and ‘princom’ functions.
During the time that this blog has been running several posts about longitudinal data has been published. However, we have not talked yet about how we can deal with them with R.
As we have discussed in other posts, longitudinal data gather information from a group of study subjects over time (repeated measures). When the number of measurements is the same for all subjects and these measurements are equidistant along time, it is said that we have a balanced data. Otherwise our data are called unbalanced.
When working with either format in R, the joineR package that allows us to adapt our data. If our data are balanced we can move from one format to another according to the analysis we are interest in, simply by:
# generate a (balanced) data simul <- data.frame(id = 1:3, Y.1 = rnorm(3), Y.2 = rnorm(3), age = runif(3,0,18)) # move it to an unbalanced format simul.unbal <- to.unbalanced(simul, id.col = 1, times = c(1,2), Y.col = 2:3, other.col = 4) # return data to a balanced format simul.bal <- to.balanced(simul.unbal, id.col = "id", time.col = "time", Y.col = c("Y.1"), other.col = 4)
Once we have our data in an appropriate format, one of the first descriptive analysis to do is to get the empirical longitudinal variogram. The variogram allows us to check if within-subjects observations are related. In order to do that, we need to have the unbalanced data format and we will get it very easily by using the joineR variogram (although expect it to be a bit slow if you have a large dataset).
As an example, we will load a data set included in the Applied Longitudinal Data Analysis book by Judith D. Singer and John B. Willett and calculate the empirical variogram:
# read in data set (tolerance data from ALDA book) tolerance <-read.csv("http://www.ats.ucla.edu/stat/r/examples/alda/data/tolerance1_pp.txt") vgm <- variogram(indv=tolerance$id, time=tolerance$time, Y=tolerance$tolerance) plot(vgm)
The package also allows us to make some more plotting functions and analysis of longitudinal and survival data together using random effects joint models. Certainly a very interesting package for those who deal with this type of data or are interested in start working with them.
Try it and tell us your experience.
Collaboration between statisticians and researchers
In the last few days I was thinking about how researchers could collaborate efficiently with their experts in statistics. With the increasing complexity in science, interchanging information can be crucial to get the best results. But what happens when a researcher doesn’t give all the information a statistician needs?
When someone asks for help in this field -as a clinician, I need it often-, many biostatisticians ensure that some basic points are met in a research study -prior to the actual analysis-, and I think they should ask for them as soon as possible:
Have they -researchers, clinicians, whoever-, met all three tips above? Then, your chances of helping them to answer the question will increase, and even the time to the answer can decrease substantially.
I hope this has been useful!
You can find my blog “Tales of R…messing around with free code” here.
I have to admit I am not completely comfortable with the RStudio IDE yet. However, RStudio projects are of great interest, and the new package Shiny – released last November- is no exception.
Shiny allows you to build web applications in R, so anyone might be able to access your analysis results through an interactive web browser interface. As they state in their website “…your users choose input parameters using friendly controls like sliders, drop-downs, and text fields. Easily incorporate any number of outputs like plots, tables, and summaries..”
The application is structured in two components: a user interface script named ui.R and a server script – server.R. The first one deals with the input and output format, the second one contains the code to run the analysis.
I made a quick trial following the tutorial and my (very simple) web app was ready in no time. It is based on some of the examples of the SNPassoc vignette, an R package designed to perform genetic association studies. In this application, you can check the summary for a small set of SNPs and get both a plot showing the frequencies and the results of the association study for a given SNP.
By using Shiny, you can run your applications locally or share them with other users so they can run the applications in their own computers. There are several options to distribute your apps, you can check all of them here. In this case, the app can be accessed through the GitHub Git repository. Once the Shiny package is installed (along with the SNPassoc package used in this example) you just have to run the following code:
library(shiny) shiny:: runGist('4e5e618431a59abe692b')
In my opinion this tool has great potential for sharing analysis results through an interactive and friendly interface. It might replace files containing a lot of graphics and tables while saving time both to the data analyst and the end user. Have you tried it? Do you want to share your apps with us?
Guillermo Vinué Visús completed his degree in Mathematics in 2008, granted by the Universitat de València (Spain). He also holds a Master’s degree in Biostatistics from the same university. After working at the Drug Research Center of the Santa Creu i Sant Pau Hospital in Barcelona (Spain) for a year, he is currently a PhD student in the Department of Statistics and Operations Research at the Universitat de València. His doctoral thesis focuses on Statistics applied to Anthropometry.
1. Why do you like Biostatistics?
I like Biostatistics because it allows me to apply Maths to different real life problems.
2. Could you give us some insight in your current field of research?
I am working on a research project aiming to develop statistical methodologies to deal with anthropometric data, in order to tackle some statistical problems related to Anthropometry and Ergonomics.
3. Which are, in your opinion, the main advantages of being a researcher?
The main advantage is the possibility to learn everyday a bit more. Research is a continuous learning process.
4. Your whole professional experience has been within the public sector and the University. How do you see the present and future of research in the Spanish public sector?
The current situation of economic difficulties has caused that unfortunately the government budget for scientific research is more and more limited, so I am concerned about both present and future of the public Spanish research.
5. What do you think of the situation of young biostatisticians in Spain?
Neither better nor worse than other young people in the country. Nowadays, I guess the best way to make progress is to move abroad.
6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?
Enthusiasm, effort and a little bit of R knowledge.
7. Which do you think are the main qualities of a good mentor?
To be attentive and available when needed.
Selected publications:
Ipek Guler is an intern at Biostatech, Advice, Training and Innovation in Biostatistics, S.L. She has a MSc in Statistical Techniques from the University of Santiago de Compostela (Spain) and a BSc in Statistics from Gazi University (Turkey).
In previous posts about longitudinal data by Urko Agirre and Hèctor Perpiñán, mentions have already been made to the Mixed Models methodology .
Usually in medical and biological studies, the designed experiments include repeated measures which are used to investigate changes during a period of time repeatedly for each subject in the study. Multiple measurements are obtained for each individual at different times. These type of longitudinal data can be analyzed with a mixed effects model (Pinheiro and Bates, 2000) which allows modeling and analysis of between and within individual variation. An example of such data can be growth curves measured over time.
In many practical situations, using traditional parametric regression techniques is not appropriate to model such curves. Misleading conclusions may be reached if the time effect is incorrectly specified.
Durban et al, (2005) presented flexible mixed models for fitting subject specific curves and factor by curve interactions for longitudinal data in which the individual and interaction curves are modeled as penalized-splines (P-splines) and model’s estimation is based on the mixed model representation of P-splines (Currie and Durban, 2002).
This representation is quite interesting because it allows us to use the methodology and the several software available for mixed models (e.g., nlme and lme4 packages in R), and it also comes equipped with an automatic smoothing parameter choice that corresponds to maximum likelihood (ML) and/or restricted maximum likelihood (REML) estimation of variance components. With this representation, the smoothing parameter becomes the ratio between the variance of residuals and the variance of the random effects.
First of all, let`s define a linear mixed model for a longitudinal data.
As a matrix notation;
, assumed to be , independently of each other and of the is distributed as where and are positive definitive covariance matrices.
Consider a flexible regression model,
Where is the response variable of the observations and is the smooth function of covariate . We represent this function with a linear combination of known basis function
And we reformulate it as;
Depending on the basis that is used for the P-splines, and have the following forms;
Where is the matrix that contains the eigenvectors of the singular value decomposition of the penalty matrix and is a diagonal matrix containing the eigenvalues with null eigen values (Lee, 2010).
Therefore the model becomes;
and
Where is the number of columns of basis , and the smoothing parameter becomes; (Durban et al, 2005).
With this mixed model representation of P-splines, we can obtain more flexible mixed models which allows us to have a simple implementation of otherwise complicated models. In future posts, we can talk more about these flexible models with individual curves and factor by curve interactions also described in Durban et al (2005).
Performing statistical analyses by hand in the era of new technologies would seem crazy. Nowadays, there are three main statistical programs for doing statistics: IBM SPSS, SAS and R, as it can be read in a more extensive post of this site. Sometimes, biostatisticians need to use more than one package to carry out their analyses. This means that users of these programs have to move from one to another environment, from front-end to back-end, using different wizard and graphical interfaces, wasting in most occasions an important amount of time. Because of that, in this post I would like to address the differences between the aforementioned programs, pointing out the advantages and the most suitable usage of each software from my personal point of view.
For a good choice of a statistical program, one should take into account several issues such as the following:
1. Habit: It refers to how the biostatistician has gotten used to one particular statistical program and does not need to learn something new.
2. Distrust of freeware: it refers to the license type of the package.
3. Availability of statistical procedures: are the most used statistical methods included in the package?
4. Management of huge data sets: how big data sets are handled.
Focusing on the first point, it is remarkable that programming with these three packages is quite easy. All of them provide a choice to write the syntax code and run it whenever you want. This also can be applied to anyone who might have to send you the code, including managers or colleagues. Besides, IBM SPSS and R offer a friendly interface allowing the alternative of not learning the syntax procedure. These could be the main reasons for which people continue doing data analysis for a long period with the same statistical program, making a better researcher life!
Another important issue is whether the distribution is freeware or not. When starting to work as a biostatistician, one soon notices the high cost of statistical software licence fees. As you might know, IBM SPSS and SAS are commercial distributions and R is freeware. Although IBM SPSS and SAS are quite expensive, one should think about the benefits that both programs offer in the long term. For example, some researchers do not trust R, and think its use does not ensure getting the correct results. In those cases, trust comes at a price. Regadless of what they think, R has grown rapidly in the past few years, offering competition to other softwares.
The availability of the most used statistical procedures is another important point to be taken into account. All of them have integrated in their environments the most popular statistical methods (regression analysis, correspondence analysis, etc..) but sometimes, there is a need for implementing more sophisticated procedures. As IBM SPSS and SAS are paid-for distributions, they do not let the user to program it, they have fixed instructions. On the contrary, R could be understood as a combined and customized software of multiple small packages to solve the problems. Whenever you have a problem, it is most likely that you will find a solution for it in the packages repository.
Finally, as for the management of huge data sets, it is noteworthy that SAS is the winner. SAS has a robust platform for handling big data sets and it does not need the whole computer memory for the analysis. In this sense, R still has a long way to go even though the performance of R in this area has recently been exponentially increased. I would also say that IBM SPSS does not have enough support.
To sum up, I would like to give a couple of advices. From the point of view of my experience, I would recommend to use IBM SPSS for medical researchers who start doing statistics: it is quite easy and intuitive and there is no need to learn the syntax. But not for a long period of time: learning basic R coding is not difficult and will be sufficient for the requirements of this type of researchers. It is only a small effort but really worthy! Nonetheless, for advanced researchers and biostatisticians, a combination between SAS and R would be perfect – that is what I do!, I would say I am RSAS user… -. Now, you have the choice: IBM SPSS, SAS or…R?
While researching on scale-free networks, I found this book, which happens to include the very interesting article The structure of scientific collaboration networks and that will serve me as a follow-up to my previous post on social networks here.
Collaborative efforts lie in the foundations of the daily work of biostatisticians. As such, the analysis of these relationships –lack of interaction in some cases- appears to me as fascinating.
The article itself deals with the wider community of scientists, and connections are understood in terms of papers´ co-authorships. The study seems to prove the high presence of small world networks in the scientific community. However short the distance between pairs of scientists I wonder, though, how hard it is to cover that path, i.e., are we really willing to interact with colleagues outside our environment? Is the fear to step out of our comfort zone stopping us from pursuing new biostatistical challenges? Interestingly, one of Newman´s findings amongst researchers in the areas of physics, computer science, biology and medicine is that “two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.”
Interaction patterns analyzed through social networks diagrams like the one shown in Fig 1., can give us a hint on these patterns of collaboration, but can also be a means towards understanding the spread of information and research in the area (ironically, in a similar fashion to the spread of diseases as explained here).
Fig.1. Biostatistics sociogram (illustration purposes only; R code adapted from here and here)
In my previous post on the topic, I focused on the great Linkedin inmaps. I will be looking this time at Twitter and an example of the huge amount of information and the great opportunities for analysis that the platform provides. R with its package twitteR makes it even easier… After adapting the code from a really useful post (see here), I obtained data relating to twitter users and the number of times they used certain hashtags (see plots in Fig. 2).
Fig.2. Frequency counts for #bio (top left), #statistics (top right), #biostatistics (bottom left), and #epidemiology (bottom right). Twitter account accessed on the 17^{th} of May 2013.
Although not an exhaustive analysis, it is interesting to notice the lower figures for #biostatistics (turquoise) and #statistics (pink), compared to #bio (green) and #epidemiology (blue) for example (please notice the different scales in the y axis for the four plots). It makes me wonder if the activity in the field is not our strongest point and whether it would be a fantastic way to promote our profession. I am certainly convinced of the great benefits a higher presence in the media would have, particularly in making it more attractive for the younger generations.
That was just a little peek of even more exciting analysis to come up in future posts, meanwhile see you on the media!
Do you make any use of the social networks in your work? Any interesting findings? Can´t wait to hear them all!
Luis Carlos Silva Ayçaguer. Senior researcher of the Escuela Nacional de Salud Pública from La Habana, Cuba; member of the development team of Epidat. Degree in Mathematics from Universidad de La Habana (1976), PhD from Universidad Carolina (Praga, 1982), Doctor of Science from Universidad de Ciencias Médicas (La Habana, 1999), Titular Academic from República de Cuba.
Email: lcsilva@infomed.sld.cu
Soly Santiago Pérez. Technical statistician at Dirección Xeral de Innovación e Xestión da Saúde Pública (General Directorate of Public Health, Spain) from 1996 to present; member of the development team of Epidat. Degree in Mathematics from Universidad de Santiago de Compostela (Spain, 1994). Specialist in Statistics and Operational Research.
Email: soly.santiago.perez@sergas.es
In this post, we present a user friendly tool for applying bayesian methods in simple situations. This tool is part of a free distribution software package, Epidat, that is being developed by the Dirección Xeral de Innovación e Xestión da Saúde Pública (Xunta de Galicia, Spain) since the early 90’s. The general purpose of Epidat is to provide an alternative to other statistical packages for performing analysis of data; more specifically, it brings together a broad range of statistical and epidemiological techniques under a common interface. At present, the fourth version of Epidat, developed in Java, is freely available from the web page: http://dxsp.sergas.es; to download the program, registration is required.
As stated above, one of the methods or “modules” included in Epidat 4 is Bayesian analysis, a tool for the application of bayesian techniques to basic problems, like the estimation and comparison of means and proportions. The module provides a simple approach to Bayesian methods, not based on hierarchical models that go beyond the scope of Epidat.
The module of Bayesian analysis is organized into several sub-modules with the following scheme:
The first option of the module can be used to apply Bayes’ theorem. The following three options (Odds ratio, Proportion and Mean) are designed to solve basic inferential problems under Bayesian logic: estimation of odds ratios, proportions and means, as well as differences or ratios of these two last parameters. The punctual estimation is accompanied by the corresponding credibility interval. The techniques available in these options also include methods related to hypothesis testing. Finally, the last sub-module allows the evaluation of conventional tests from a Bayesian perspective.
Some specific options of Bayesian analysis require the use of simulation techniques. Nevertheless, the user does not need to understand the simulation process to use the program and interpret the results correctly. In addition, the module has a user-friendly interface that allows the user to plot the a priori distribution and choose the values of its parameters (see figure above). The output of Bayesian analysis includes both numerical and graphical results, and the graphics can be edited and modified by the user. In addition, the contents of the output window can be saved as a file in several formats: *.epi, *.pdf, *.odf, or *.rtf.
Finally, like all modules of Epidat, Bayesian analysis has a useful help feature which can be accessed on the help menu in pdf format. This facility has been developed with a didactic and critical approach, including statistical basics of the methods, bibliographic references, and examples of the different options.
The last post published in FreshBiostats, by Hèctor Perpiñán, was about reports combining the use of R and Latex. In it, he explained how to transport tables and figures from the R output to Latex.
Well then, this week I shall continue talking about another tool to write reports when we do the statistical analysis with R. It allows us to reduce the time spent to draft the reports and be more accurate by minimising human copying errors. I am going to focus my attention on “R2wd” package to write MS-Word documents from R.
Although most statisticians are used to work daily with Latex which results in high quality appearance, in most cases our clients have never heard about it and prefer to work with MS-Word documents, which are easier to handle.
The R2wd package needs either the statconnDCOM server (via the rcom package) or the RDCOMClient to communicate with MS-Word via the COM interface. Once this interface is installed (on Windows OS), we can create a document using all functions available in this package: wdGet, wdTitle, wdSection, wdType, wdSetFont, wdTable (the object must be a data frame or an array), wdPlot, wdApplyTheme, etc.
Not only do these functions allow us to introduce tables or figures from R output, but can also be used to inject text elements into Word report, themes and templates, different type of text (normal, italic, Verbatim,…), etc.
The following code shows a little example on how to use this package:
</pre> ### An easy example to use "R2wd" package ### library("R2wd") #of course, you must install it before! #require("rcom") wdGet() #To open a new Word document #It is possible that we get an error message with wdGet() function. #In this case we will need to install: #installstatconnDCOM() wdTitle("We introduce here our file title") wdSection("First section begins here") wdBody("Inserts text in 'Body' style at the current cursor point in Word.") wdWrite("The function wdBody is similar to 'wdWrite'",paragraph=TRUE) wdType("Now we use an italic type and a centering text",italic=TRUE,alignment="center") wdNormal("To return a Normal type") wdWrite("We can also insert footnotes") wdInsertFootnote("insert a footnote") #you can insert footnotes in the #word document at the cursor position. wdSection("Now we include another section") wdSubsection("A toy example") wdBody("Insert tables of results") age <- c(23,12,45,34,18,41) gender <- c(1,1,0,1,0,0) height <- c(172, 150,169,180,188,160) data <- data.frame(age,gender,height) wdTable(data,autoformat=2)#three possibilities to autoformat parameter wdItemize() wdWrite("insert a text in the bullet",paragraph=TRUE) wdItemize(Template=3) wdWrite("we finish the description",paragraph=TRUE) wdBody("To finish this easy example we include a plot in our Word document") wdPlot(data$age,data$height,col="red",pch=20,height = 10, width =10, pointsize = 25) wdSave() wdQuit() #To close the session
As you have seen, besides being able to format our report as we want (although it may seem cumbersome to do from R), R2wd is really useful because with an only function – wdTable() – and one “click” I can copy a large table in my report with no mistakes.
You will find more functions here. Dare to try it! You will save plenty of time!
In another post I have already talked about different topics not directly associated to Statistic. At this time I will write about reports combining the use of R and Latex. Specifically about how to export tables to Latex and insert latex typesetting in R figures.
Obviously our technical work is very important but the final presentation which should be taken care of. Sometimes the problem is to move our results from the statistical software to the text editor so I will talk in this post about my favourite tools to perform this task.
When looking for a text editor to do a report, there are may options to choose between (MS Word, Latex, knitr/Sweave, Markdown, etc.). I pick out Latex versus MS Word given its possibility to easily include math language. The connection between knitr/Sweave and Markdown with R is straightforward (these two software imbibe R code), in future posts I will talk about them.
R tables to Latex
As I commented I first work with R and then move the results to Latex. One of the simplest and most effective tools is the function xtable() that we can find in the package of the same name, xtable.
Other important packages in R with extra functionalities to convert object from R to Latex code are:
Figures
Many people will know the graphics of Latex, called TikZ (see page examples), but what is not so known is that this type of graphics allow a wonderful connection between R and Latex. It allows that formulas, numbers, etc. displayed in the format of choice in Latex. To do this we simply have to follow the next steps:
As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.
Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.
Some R functions I have found quite useful when handling this data include the following ones:
As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.
Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:
Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.
This is just a brief summary of some options availabe in R. Any other tips on string handling?
Moisés Gómez Mateu is a PhD student at the Universitat Politècnica de Catalunya (UPC) where he works as research assistant.
1. Why do you like Biostatistics?
I have been very curious since I was a child; that’s why I like statistics. Moreover, statistics related to biology and medicine allows helping people to improve their quality of life.
2. Could you give us some insight in your current field of research?
I focus my thesis on survival analysis, especially on the issue of composite endpoints in clinical trials. The main aim is to analyze what is the best primary endpoint to use, extend the statistical theory, and make practical tools available to researchers by means of a library in R, an on-line platform, etc.
3. Did you find it difficult to move from the private sector to the University?
No. In fact, I left my job as a consultant in a marketing research company to study the MSc Statistics at the UPC, and I think it was a very good decision.
4. Which are, in your opinion, the main advantages of being a researcher?
It is very satisfactory. The results you get or the research you conduct have nothing to do with the private sector. One usually investigates issues that are related to thing you like and to help improving science in general, not only to earn money.
5. What do you think of the situation of young biostatisticians in Spain?
The reality is that several colleagues and friends are working-studying abroad or looking for opportunities …
6. What would be the 3 main characteristics or skills you would use to describe a good biostatistician?
Curiosity, Analytical skills and Creativity.
7. Which do you think are the main qualities of a good mentor?
Expertise, Modesty and Open-mindedness.
Selected publications:
Living creatures tend to organize their lives within structured communities such as families, schools, hospitals, towns or countries. For instance, students of the same age living in a town could be grouped into different classes according to their grade level, family income, school district and other features of interest. Other examples related with health care workers and patients show clear hierarchical data structures as well.
Hierarchical or nested structures (usually known as HLM) are very common throughout many research areas. The starting point of this data pattern was set up in the field of social sciences. Most research studies in this area were focused on educational data, where the main interest was to examine the relationship between inputs such as students, teachers or school resources, and student outcomes (academic achievement, self-concept, career aspirations…). Under this scenario, researchers emphasized that individuals who are drawn from an institution (classroom, school, town, …) will be more homogenous than subjects randomly sampled from a larger population: students belonging to the same classroom have the experience of being part of the same environment (places, district, teachers,…) and experiences. Due to this fact, observations based on these individuals are not fully independent.
As noted, hierarchical models consider that exist dependency among observations within the same study unit. Till last decades, owing to the lack of software development, ordinary least squares regression (OLSR), classical regression, has been used to estimate the aforementioned relationships. On consequence, results obtained from OLSR show too small standard errors leading to a higher probability of rejection of a null hypothesis than if: (1) an appropriate statistical analysis was performed; (2) data included truly independent observations. It is clear that the main issue that researchers must address is the non-independence of the observations.
Hierarchical modeling is similar to OLSR. It can be seen as an extension of the classical regression, where at least two levels are defined in the predictive model. On the base level (also called the individual level, referred as level1), the analysis is similar to OLSR: an outcome variable is defined as function of a linear combination of one or more independent level 1 variables:
where is the value of the outcome variable of the th individual of group , represents the intercept of group , is the slope of variable of group . On subsequent levels, the level 1 slope and intercept become dependent variables being predicted from level 2 variables:
Though this process, it is possible to model the effects of level 1 variables and level 2 variables on the desired outcome. In the figure of this post, one can observe that there are three main levels: patients (level1) belong to hospitals (level 2) where, at the same time, hospitals are located in certain neighborhoods (level 3).
This kind of modeling is essential to account for individual – and group level variation in estimating group-level regression coefficients. However, in certain cases, the classical and HLM approaches coincide: (1) when there is very little group-level variation and (2) when the number of groups is small and consequently, there is not enough information to accurately estimate group-level variation. In this setting, HLM gain little from classical OLRS.
Now, it is your opportunity. You know where is worth the effort of applying the HLM methods instead of classical regression.