Featured

Analysis of PubMed search results using R

Looking for information about meta-analysis in R (subject for an upcoming post as it has become a popular practice to analyze data from different Genome Wide Association studies) I came across  this tutorial from The R User Conference 2013 – I couldn´t make it this time, even when it was held so close, maybe Los Angeles next year…

Back to the topic at hand, that is how I found out about the RISmed package which is meant to retrieve information from PubMed. It looked really interesting because, as you may imagine,this is one of the most used resources in my daily routine.

Its use is quite straightforward. First, you define the query and download data from the database (be careful about your IP being blocked from accessing NCBI in the case of large jobs!) . Then, you might use the information to look for trends on a topic of interest or extracting specific information from abstracts, getting descriptives,…

In order to try it out, I decided to get data regarding what has been published relating to Next Generation Sequencing. For doing so, I adopted the search terms proposed in the paper by Jia et al. Through the following code we can get the PubMed results for these search terms since 1980:

library(RISmed)
query = "(exome OR whole OR deep OR high-throughput OR (next AND generation) OR (massively AND parallel)) AND sequencing"
ngs_search <- EUtilsSummary(query, type="esearch",db = "pubmed",mindate=1980, maxdate=2013, retmax=30000)
QueryCount(ngs_search)
ngs_records <- EUtilsGet(ngs_search)
years <- Year(ngs_records)
ngs_pubs_count <- as.data.frame(table(years))

This code allow us to get published papers on this topic per year. By getting also data about the total number of publications per year, we are able to normalize the data. The complete R code, once the data are downloaded and edited can be found at  FreshBiostats GitHub Gist. In the next graph, we can see the publication trend for Next Generation Sequencing per year:

ngs_year

I was also curious about which ones would be the journals with the highest number of publications on this topic. Using the following code we can get the count of NGS publications per journal:

journal <- MedlineTA(ngs_records)
ngs_journal_count <- as.data.frame(table(journal))
ngs_journal_count_top25 <- ngs_journal_count[order(-ngs_journal_count[,2]),][1:25,]
Again, the complete code that allows us to normalize the data by the total number of publications per journal, as well as the following barplots showing the result, is available at our Gist:

ngs_publications_total

ngs_publications_normalized

You cand find some other examples using this package at Dave Tangs Bioinformatics blog. Additionally, some alternatives to the use of RISmed package can be found at R Chronicle and R Psychologist blogs.

Other potential applications of this package include creating a co-author network, as is described in Matthew Maenner´s blog.

Search and analyze carefully!

11 thoughts on “Analysis of PubMed search results using R

    • Thank you, Martí. I´m glad you enjoyed it.
      A promise is a promise, so I really hope to be able to address that issue in my next post!

      • Thanks. I should read the tutorial but do you know if is posible to do teh same with Web of Science databases?

  1. Hello Pilar,
    I tried the FreshBiostats GitHub Gist code and in the line 39:

    ngs_journal_count_top25 <- journal_count[order(-journal_count[,2]),][1:25,]

    I think you should replace "journal_count" to "ngs_journal_count".

    Thank you.

    Best regards,

    Martí Casals.

  2. Great post!
    I’m having an issue that seems to arise when there are zero publications in a given year for a given query:

    E.g.:

    search <- EUtilsSummary("tp53", type="esearch",db = "pubmed",mindate=1960,maxdate=1960, retmax=30000)

    Error in validObject(.Object) :
    invalid class “EUtilsSummary” object: invalid object for slot "id" in class "EUtilsSummary": got class "list", should be or extend class "character"

    Any idea how to solve this?

    Thanks a lot for your help!

    • Thank you so much for the tip, Josh! I didn’t have to face that problem at the time but it will sure come handy in the future.

      (Sorry I haven’t had the chance to check this till now…)

  3. For my homework, i need to search through an abstract summary on pubmed using R. one requirement is:
    “filters search results based on a term in the abstract”. Does the EUtilsSummary function search through publications or abstracts of the publications?

  4. retmax did not seem to solve the problem. I still have the following error:
    Error in validObject(.Object) :
    invalid class “EUtilsSummary” object: 1: invalid object for slot “PMID” in class “EUtilsSummary”: got class “list”, should be or extend class “character”
    invalid class “EUtilsSummary” object: 2: invalid object for slot “querytranslation” in class “EUtilsSummary”: got class “list”, should be or extend class “character”
    Is this a different problem?

    • Hi Abdhi,
      First of all, I´m very sorry for the late reply. The comment had gone unnoticed. In case you haven´t solved the problem yet, could you provide your string query?

Leave a comment