Handling multiple data frames in R

I am not -yet- that highly skilled at programming in R, so when I run into a function/package that really meets my needs, well, that is quite a big deal. And that is what happened with the plyr package.

I often have to work with a high number of data frames. It is not about “big” statistics here, but just some basic descriptives, subsettings…  It has to do more with handling this amount of data in an easy and efficient way.

There are two routines when manipulating these data sets I find essential. The first one is being able to operate with all these data frames at once (eg. subsetting  for filtering) by creating lists.

So let´s say we have a certain number of data frames, file_1, file_2, … each of them with the same variables named  var1, var2,… and want to subset all of them based on a certain variable.

dataframes <- list.files(pattern = ”file_”)
library(plyr)
list_dataframes <- llply(dataframes, read.table, header = T, sep = "\t")
dimensions <- ldply(list_dataframes, dim)
filter <- llply(list_dataframes, subset, var1 == ”myvalue”)
selection <- llply(list_dataframes, subset, select = c(var1,var3))

No need for “for” loops here!  It is certain much neater and easier this way. More information about llply, ldply or laply can be found at the plyr R tutorial. Much has been said about its advantages in other blogs, you can check it here or in my “indispensable” gettinggeneticsdone.

The second one would allow us to identify common values between those data frames. The first function that comes to mind is merge. Again several useful posts about it (gettinggenetics done, r-statistics) and it sure serves the purpose in many cases. But quite frequently, you find yourself in the situation where you have got several data frames to merge ;  merge_all and merge_recurse in the reshape package overcome this problem. There is an excellent R wiki entry covering this topic.
As an alternative to merge, join (again in the plyr package), lets you specify how duplicates should be matched.

Note that both packages- plyr and reshape- are developed by Hadley Wickham, ggplot2´s creator.

These functions have become part of my daily routine and they definitely save me a lot of trouble. I have yet to explore another package I read great things about: sqldf.

Do you have any other suggestions on manipulating data frames?

Advertisements

One thought on “Handling multiple data frames in R

  1. Thanks for a great post and for introducing us to this amazing package, Pilar! It´s completely new to me and I´m finding it extremely useful…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s