Featured

More on handling data frames in R: dplyr package

I am currently taken an edX course, which is becoming one of my favorite online platforms, along with coursera. This time it is “Data Analysis for Genomics”, conducted by Rafael Irizarry, you might know him by his indispensable Simply Statistics blog. The course has just started, but so far the feeling is really good. It has been through this course that I have found about a new package by the great Hadley Wickhamdplyr. It is meant to work exclusively with data frames and provides some improvements over his previous plyr package.

As I spend quite a bit of time working with this kind of data, and having written a post some time ago about how to handle multiple data frames with the plyr package, I find it fair to update on this one. Besides I think it is extremely useful, and I have already incorporated some of its functions to my daily routine.

This package is meant to work with data frames and not with vectors like the base functions. Its functionalities might replace the ddply function in plyr package (one thing to mention is that is not possible to work with lists yet –as far as I know-). Four functions: filter( ) – instead of subset- , arrange( ) –instead of sort -, select( ) –equivalent to using select argument in subset function-, and mutate( ) – instead of transform- are, in my opinion, reason enough to move to this package. You can check here some examples on using these functions. The syntax is clearly improved and the code gets much neater, no doubt about it.

Two other essential functions are group_by( ) that allows to group the data frame by one or several variables and summarise( ) for calculating summary statistics on grouped data.

You can find general information about the package at the rstudio blog or several other blogs talking about its goodness, here or here.

Not to mention its other great advantage, the speed, not a minor issue for those of us who work with extremely large data frames. Several speed tests have been perfomed (here and here),( and it seems to clearly outperform the speed of plyr or data.table packages***).

 

I am so glad to have found it… I hope you will be too!

 

***Additional speed tests results will be published soon since this statement might be wrong.

Handling multiple data frames in R

I am not -yet- that highly skilled at programming in R, so when I run into a function/package that really meets my needs, well, that is quite a big deal. And that is what happened with the plyr package.

I often have to work with a high number of data frames. It is not about “big” statistics here, but just some basic descriptives, subsettings…  It has to do more with handling this amount of data in an easy and efficient way.

There are two routines when manipulating these data sets I find essential. The first one is being able to operate with all these data frames at once (eg. subsetting  for filtering) by creating lists.

So let´s say we have a certain number of data frames, file_1, file_2, … each of them with the same variables named  var1, var2,… and want to subset all of them based on a certain variable.

dataframes <- list.files(pattern = ”file_”)
library(plyr)
list_dataframes <- llply(dataframes, read.table, header = T, sep = "\t")
dimensions <- ldply(list_dataframes, dim)
filter <- llply(list_dataframes, subset, var1 == ”myvalue”)
selection <- llply(list_dataframes, subset, select = c(var1,var3))

No need for “for” loops here!  It is certain much neater and easier this way. More information about llply, ldply or laply can be found at the plyr R tutorial. Much has been said about its advantages in other blogs, you can check it here or in my “indispensable” gettinggeneticsdone.

The second one would allow us to identify common values between those data frames. The first function that comes to mind is merge. Again several useful posts about it (gettinggenetics done, r-statistics) and it sure serves the purpose in many cases. But quite frequently, you find yourself in the situation where you have got several data frames to merge ;  merge_all and merge_recurse in the reshape package overcome this problem. There is an excellent R wiki entry covering this topic.
As an alternative to merge, join (again in the plyr package), lets you specify how duplicates should be matched.

Note that both packages- plyr and reshape- are developed by Hadley Wickham, ggplot2´s creator.

These functions have become part of my daily routine and they definitely save me a lot of trouble. I have yet to explore another package I read great things about: sqldf.

Do you have any other suggestions on manipulating data frames?