Featured

Increasing your research visibility -through a personal website with GitHub-

Letting the world know about your work is important. This is true for any field of research, but for those of us working in any kind of computing-related field, it becomes even more relevant.

Professional networks, distribution lists, blogs and forums are an excellent way to keep up to date with research in your field (you can check a previous post on this topic here). Even social networks can be a great way to do so, depending on how you decide to use them. For instance, you can find quite a number of PostDoc positions posted on twitter.

There is another important aspect to consider as well. By being an active part of the community, not only do you gain visibility, but you also get the chance to contribute in return… How many issues have you solved through StackOverflow, Biostars, R-bloggers posts and so on? How many articles were you able to get through ResearchGate?

Back to the matter at hand, I have just recently found out about how to create a personal website with GitHub . If, like me, you do not know much about HTML programming, you might want to use the automatic page generator. I can tell you: if you have already figured out the content you want on it, and providing you already have a GitHub account, it will take you less than 5 minutes, (see a previous post on how to create an account here). Starting from a simple static web -simple is beautiful-, it can evolve into more sophisticated sites.

Advertisements
Featured

More on handling data frames in R: dplyr package

I am currently taken an edX course, which is becoming one of my favorite online platforms, along with coursera. This time it is “Data Analysis for Genomics”, conducted by Rafael Irizarry, you might know him by his indispensable Simply Statistics blog. The course has just started, but so far the feeling is really good. It has been through this course that I have found about a new package by the great Hadley Wickhamdplyr. It is meant to work exclusively with data frames and provides some improvements over his previous plyr package.

As I spend quite a bit of time working with this kind of data, and having written a post some time ago about how to handle multiple data frames with the plyr package, I find it fair to update on this one. Besides I think it is extremely useful, and I have already incorporated some of its functions to my daily routine.

This package is meant to work with data frames and not with vectors like the base functions. Its functionalities might replace the ddply function in plyr package (one thing to mention is that is not possible to work with lists yet –as far as I know-). Four functions: filter( ) – instead of subset- , arrange( ) –instead of sort -, select( ) –equivalent to using select argument in subset function-, and mutate( ) – instead of transform- are, in my opinion, reason enough to move to this package. You can check here some examples on using these functions. The syntax is clearly improved and the code gets much neater, no doubt about it.

Two other essential functions are group_by( ) that allows to group the data frame by one or several variables and summarise( ) for calculating summary statistics on grouped data.

You can find general information about the package at the rstudio blog or several other blogs talking about its goodness, here or here.

Not to mention its other great advantage, the speed, not a minor issue for those of us who work with extremely large data frames. Several speed tests have been perfomed (here and here),( and it seems to clearly outperform the speed of plyr or data.table packages***).

 

I am so glad to have found it… I hope you will be too!

 

***Additional speed tests results will be published soon since this statement might be wrong.

Featured

Analysis of PubMed search results using R

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across  this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980:

library(RISmed)
query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing"
ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000)
QueryCount(ngs_search)
ngs_records <- EUtilsGet(ngs_search)
years <- Year(ngs_records)
ngs_pubs_count <- as.data.frame(table(years))

This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at  FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year:

ngs_year

I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal:

journal <- MedlineTA(ngs_records)
ngs_journal_count <- as.data.frame(table(journal))
ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]
Again, the complete code that allows us to normalize the data by the total number of publications per journal, as well as the following barplots showing the result, is available at our Gist:

ngs_publications_total

ngs_publications_normalized

You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs.

Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog.

Search and analyze carefully!

Featured

Keeping up to date with research in your field (Part II)

  • Online free courses/tutorials: there is plenty of material on line, which makes it sometimes difficult to filter what is really worthy. Here again, tips from blogs or colleagues from your network might serve as reference. Coursera is, in my opinion, one of the best platforms, due to the quality and versatility of its courses. There are several excellent courses related to Statistics and Data Analysis. Some of them are more general about R programming (e.g. Data Analysis,Computing for Data Analysis – both using R- ),but  there are also more specific ones (e.g. Design and Interpretation of Clinical Trials, Statistical Analysis of fMRI Data,.. you can check the full list here.

I would like to mention here some other resources available for those with a math/statistics background who might be interested in getting some insight into genetics. As we mentioned previously in other posts, it is critical to understand the data you are dealing with and these websites will help you with that:

An extensive list of additional Online Genetics Education Resources can be found at the NHGRI site

For those wanting to get an introduction to NGS, there is a Next Generation Sequencing Practical Course at EMB-EBI Train online. A more advanced tutorial, showing the use of R/Bioconductor  packages for High-Throuput Sequence Analysis can be found here.

There are, obviously, countless courses and tutorials about R and specific packages. Besides, GitHub is becoming more and more popular.By creating Gist on GitHub you can share your code quickly and easily, see a quick example here.

  • Webinars:  many commercial sites offer highly focused free Webinars that might be of interest. For instance both Science and Nature host webcasts regularly.

  •  Forums /discussion list: when you are stuck with something and you are not able to find a solution, specialized forums might come to the rescue. Either because your same question has been asked before, or because there is someone willing to help, you will most likely get your doubt solved. Two forums are particularly useful in my field, BioStar and SEQanswers. Talking about R programming, R-help from R Mailing List and Stack Overflow are two of the sites where you can found most of your doubts solved. Our life without them would be much more difficult for sure…

As I mentioned at the beginning of the previous post, it is sometimes difficult to find a balance between the time you spend learning and your more “productive” time. Besides for those of us whose work is also a passion, the line between work and personal interests becomes blurred quite often. And so we will spend much of our leisure time diving around new stuff that eventually will be useful in our work. Some might argue that the time spent in training or the amount of information you have access to might be overwhelming. Is it worth the effort? How much time should we invest in learning? Are we able to take advantage of what we learn? You can take a look at this video  for more elaborate thoughts on the subject.

I hope the information contained in these posts might be useful… Your suggestions on additional resources will be more than welcome!

Featured

Keeping up to date with research in your field (Part I)

No doubt about it, we must keep up with news and advances in our area of expertise. In this series of two posts I just want to introduce the ways I find useful in order to achieve this goal. Staying up-to-date means not only knowing what is being done in your field but also learning new skills, tools or tricks that might be applied. I will save for last some thoughts about getting a proper work-learning balance and potential impact on productivity.

  • Blogs. It might be an obvious one, but it is for sure one of my main sources of information. Several blogs I follow include: Getting Genetics Done, Genomes Unzipped, Our 2 SNPs, Wellcome Trust, R-BloggersSimply Statistics and many others mainly focused on biostatistics that you can find in our blog roll. Most of them are accessible through RSS feeds, if not through mail subscription.
  • Twitter. Most blogs have also a twitter account where you can follow their updates (so it might be an alternative). You can follow twitter accounts from networks of interest, companies or people working in your field too. For some ideas on whom to follow, go to our twitter!
  • PubMed / Journals alerting services. A keyword specific PubMed search can be just as relevant. Both available through email and RSS Feeds, you will get updates containing your search terms (for instance “Next Generation Sequencing”, “rare variant association analysis”, “Spastic Paraplegia”…). You can also get information about an author´s work or the citations of a given paper. You can find here how to do it.  An alternative is to set up alerts for Table of Contents of your journals of interests, informing of the topics of latest papers (Nature Genetics, Bioinformatics, Genome Research, Human Mutation, Biostatistics…) Accessing RSS Feeds through your mail app is straightforward -Mozilla Thunderbird in my case-.
  • Professional networking sites. Obviously, when it is all about networking, having a good network of colleagues is one of the best ways to keep up with job offers, news or links to resources. For instance through my LinkedIn contacts I receive quite a bunch of useful tips. Well selected LinkedIn groups are also a source of very valuable information and news, as well as companies in your area or work (pharma industry, genomic services, biostatistics/bioinformatics consulting). This is a more general site, but there are other professional sites focused on Research: ResearchGate and Mendeley. Mendeley in particular, apart from a networking site is an excellent reference manager. This, along with MyNCBI are the two main tools I use to keep my bibliography and searches organized.
  •  Distribution lists.  Apart from general distribution lists including one´s institution or funding agency, more specific newsletters or bulletins from networks as Biostatnet or  scientific societies you belong to, are a good source of news, events and so on, or even more restricted ones (for instance in my institution an R users list has been recently created).

To be continued next week …..

Featured

Sharing statistical analysis and results through web applications

I have to admit I am not completely comfortable with the RStudio IDE yet. However, RStudio projects are of great interest, and the new package Shiny – released last November- is no exception.

Shiny allows you to build web applications in R, so anyone might be able to access your analysis results through an interactive web browser interface. As they state in their website “…your users choose input parameters using friendly controls like sliders, drop-downs, and text fields. Easily incorporate any number of outputs like plots, tables, and summaries..”

The application is structured in two components: a user interface script named ui.R  and a server script – server.R. The first one deals with the input and output format, the second one contains the code to run the analysis.

I made a quick trial following the tutorial and my (very simple) web app was ready in no time. It is based on some of the examples of the SNPassoc vignette, an R package designed to perform genetic association studies. In this application, you can check the summary for a small set of SNPs and get both a plot showing the frequencies and the results of the association study for a given SNP.

By using Shiny, you can run your applications locally or share them with other users so they can run the applications in their own computers. There are several options to distribute your apps, you can check all of them here. In this case, the app can be accessed through the GitHub Git repository. Once the Shiny package is installed (along with the SNPassoc package used in this example) you just have to run the following code:

library(shiny)
shiny:: runGist('4e5e618431a59abe692b')

In my opinion this tool has great potential for sharing analysis results through an interactive and friendly interface.  It might replace files containing a lot of graphics and tables while saving time both to the data analyst and the end user. Have you tried it? Do you want to share your apps with us?

Dealing with strings in R

As I mentioned in previous posts, I often have to work with Next Generation Sequencing data. This implies dealing with several variables that are text data or sequences of characters that might also contain spaces or numbers, e.g. gene names, functional categories or amino acid change annotations. This type of data is called string in programming language.

Finding matches is one of the most common tasks involving strings. In doing so, it is sometimes necessary to format or recode this kind of variables, as well as search for patterns.

Some R functions I have found quite useful when handling this data include the following ones:

  • colsplit ( ) in the reshape package. It allows to split up a column based on a regular expression
  • grepl ( ) for subsetting based on string values that match a given pattern. Here again we use regular expressions to describe the pattern

As you can see by the arguments of these functions, it might be useful when manipulating strings, to get comfortable handling regular expressions. More information on regular expressions to build up patterns can be found at this tutorial and in the regex R Documentation.

Some other useful functions are included in the stringr package. As the title of the package says: “Make it easier to work with strings”:

  • str_detect ( ) detects the presence or absence of a pattern in a string. It is based on the grepl function listed above
  • fixed ( ) this functions looks for matches based on fixed characters, instead of regular expressions

Once again a Hadley Wickham package, along with reshape and plyr. The three of them containing a set of helpful features for handling data frames and lists.

This is just a brief summary of some options availabe in R. Any other tips on string handling?