Hasta la vista, Statistics2013!

After twelve months filled with exciting events celebrating the great role of statistics in society, it is now time to say goodbye to the International Year of Statistics, Statistics2013.


Illustration: Cinta Arribas

From FreshBiostats we would like to thank the steering committee for making of this such a fantastic opportunity for professionals and students at all levels to interact and share the pride in what we do. We would like to send a special thank you to Ronald Wasserstein and Jeffrey Myers who helped us feel involved in the celebration with our small contributions to their newsletter and blog.

The good news is that it seems like there will be a new project starting in late January. We are really looking forward to The World of Statistics! – find here a whole lot of new activities already arranged for 2014-.


How to make an R package repository: an R-recipe

It is well known that R is one of the most used statistical package by the data scientists (researchers, technicians, etc..). This package allows flexibility to adapt already developed functions to their own interests. However, the problem comes when one needs to perform a complex algorithm that requires strong programming level or even to show to worldwide scientists how a new statistical technique works. When this scenario is present, it means that it is time to create your own package in R.

The most In this post we will learn how to make a simple R package only following several steps which are explained in the next paragraphs:

1. Update R version and other relevant tools.

First of all, if you are planning to develop a R package, it is highly recommended to update your R version to newest one. One must take into account that R versions are continuously updating so that your package would be likely to not to work properly. Furthermore, the package called Rtools is necessary to install in your system. This package has several add-ins (such as MinGW compiler) which are helpful for the development of the package.

2. Get to know your Operative System

For Windows System, you have to add to the PATH system variable the whole path where R is installed. For instance, if R is installed in c:, the PATH can be fixed as:



3. Decide the name and the kind of the package wanted to build.

This step is the most important. First of all, the package name can only be composed by numbers and letters and it is highly recommended to start it with a letter.

When building the functions belonging to the package, it is very important the arrangement of the files where the functions are written. There are two possible situations: (1) all functions in a file; (2) each function in its own file.  Both options are extreme since (1) writing all functions in a file could overload the system and (2) writing many functions in its own files implies having multiple files that should be packed. So, in order to search an alternative to these options, I suggest grouping related functions into the same files.

4. Edit some files

Once the previous steps are done, several files should be filled out/edited so that any R user who wants to use the functions within the package is able to understand what the function does.

First of all, it is obligatory to fill in  a file named DESCRIPTION. It is used to describe the package, with author, and license conditions in a structured text format that is readable by computers and by people.

Secondly, a tutorial of the package (called as VIGNETTES) should be written. An extended description with reproducible examples  shows the performance of each function within the repository.

5. Building, installing and executing the package

As final step, you have to mix all the components in a file by means of different software commands. Using the package.skeleton() function (see below) one can create the desired mypkg package:


where f, g, h, i¸ are all the functions belonging to the repository.

Once mypkg is created, the next step is to install it in our R system. You can use the common ways to perform it but I would advise using R commands to do that (R CMD INSTALL). Load the package and execute it. It is time to enjoy it!!

This has been a summary of the steps to follow for a R package repository development. Now it is your challenge!!!

Merry Christmas and Happy New Year!


Analysis of PubMed search results using R

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across  this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980:

query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing"
ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000)
ngs_records <- EUtilsGet(ngs_search)
years <- Year(ngs_records)
ngs_pubs_count <- as.data.frame(table(years))

This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at  FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year:


I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal:

journal <- MedlineTA(ngs_records)
ngs_journal_count <- as.data.frame(table(journal))
ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]
Again, the complete code that allows us to normalize the data by the total number of publications per journal, as well as the following barplots showing the result, is available at our Gist:



You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs.

Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog.

Search and analyze carefully!