Mining genomic databases with R

When dealing with genomic data, retrieving information from databases is a required endeavor.  Most of these repositories of genomic and proteomic data have very useful and friendly interfaces or browsers perfectly suitable for particular searches. But if you are working with high dimensional data e.g. NGS data, more efficient tools to mine those databases are required. R offers several packages that make this task straightforward.

NCBI2R is an R package to annotate lists of SNPs, genes and microsatellites. It obtains information from NCBI databases. This package provides quite useful functions for getting a quick glance at a given set of SNPs or genes, retrieving SNPs in a gene or getting genes belonging to a KEGG pathway (among some other features) as shown in the code below.

library(NCBI2R)
GetIDs("MAPT[sym]")
GetGeneInfo(4137)
mysnps <- GetSNPsInGenes(4137)
GetSNPInfo(mysnps)
myset <- GetIDs("KEGG pathway:Alzheimer´s disease")

biomaRt package, part of Bioconductor project, is an R interface to the well-known BioMart data management system that provides access to several data sources including Ensemble, HGNC, InterPro, Reactome and HapMap.

Most of analysis can be performed with just two functions: useMart() to define the dataset, and getBM() to perform the query. It performs onlique queries baed on attributes – values we are interested in retrieving-,  filters – restrictions on the query- and a given set of values of the filter. Once we define the dataset we are interested in, we can check all the filters and attributes available. Let´s say we want to look for those genes associated with an OMIM phenotype.

source( "http://www.bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
listMarts()
mymart <- useMart("ensembl",dataset = "hsapiens_gene_ensembl")
listFilters(mymart)
listAttributes(mymart)
getBM(attributes = c("entrezgene", "hgnc_symbol", "mim_morbid_description"), filters ="entrezgene", values = myset, mart = mymart)

Org.Hs.eg.db , which is also part of Bioconductor, is an organism specific package that provides annotation for the human genome. It is based on mapping using Entrez Gene identifiers. In a very similar way to biomaRt, it allows to query  the databases by means of the select() function specifying cols – kind of data that can be returned-, keytypes -which of the columns can be used as keys-  and a given key.

source("http://bioconductor.org/biocLite.R")
biocLite("org.Hs.eg.db")
library(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
cols(org.Hs.eg.db)
select(org.Hs.eg.db, keys= "MAPT", cols = c("SYMBOL", "CHR", "CHRLOC", "UNIPROT"), keytype = "SYMBOL")

These three packages offer in many cases a good alternative to other well-known data mining systems such as UCSC Table Browser data retrieval tool. Besides, commands from these packages are quite simple, so even those not that familiar with R language can take good benefit from them.

Just a quick tip to finish – I have just found out about fread() function (data.table package) for reading large files. It really makes a difference!

From Descriptive to Repeated Measures Data….one small step for studies, one giant leap for (bio)statistics

Traditional epidemiological descriptive studies, also called cross-sectional  studies,  have been characterized for reporting population health, describing the existing distribution of the collected exposure factors, variables, without relating to other hypotheses.  In other words, they should try to give an answer to three basic “W” questions: who, where and when. Most important uses of this kind of research include health planning and hypothesis generation. Nonetheless, the most important pitfall is that researchers might draw causal inferences when developing this type of studies. Temporal associations between the effects and the outcomes of interest might be unclear. Thus, when a researcher wants to verify the causality effect between two variables, a more appropriate design is highly recommended, such as a study with two or more observations per subject collected over the established research period. The latter design corresponds to repeated measurement data structure, more specifically, to a longitudinal data analysis (a common repeated analysis form in which measurements are recorded on individual subjects over time).

As mentioned in the previous paragraph, the main difference between both research study designs, cross-sectional and longitudinal, is that each experimental unit participating in the first one is observed only once, so for each exposure factor one has only one value per subject. In other words,  each row in the dataset is an observation. However, in longitudinal data each subject is observed  more than once.

It is also worth pointing out an increase in the complexity of the statistical approaches when moving from descriptive analysis to repeated data studies. For instance, in the first setting the statistical methods in use are the simplest ones: mean and percentage comparisons by means of classical tests, regression analysis, etc…However, in repeated measures data sets, and specifically in longitudinal data analysis, is required to  use special statistical techniques for valid analysis and inference. Thus, researchers should be aware of three important points to perform a proper statistical model, in this order:  (1) the trend of the temporal component; (2) the variance-covariance structure; (3) the mean structure. More accurately, the overall trend of the evolutive analysis should be guessed first of all. Temporal trends can follow a linear, quadratic, cubic or even a fourth grade polynomial function. Besides, as observations in the same subject are more likely to be correlated, repeated measures analysis must account for this correlation (the within and between-subject effects must be controlled). Among the possible covariance structures, compound symmetry, unstructured  and  first-order autoregressive  are the most used.  As for the mean structure, the potential exposure factors which could be related with the dependent variable should be included in the model.

longit_graph2

Longitudinal studies play an important key role, mostly in epidemiology and clinical research. They are used to determine the change in the outcome of measurement or to evaluate the effectiveness of a new treatment in a clinical trial, among other applicable settings. Under these scenarios, due to the complexity of the statistical analyses,  longitudinal studies involve a great deal of effort, but they offer several benefits. The most importants, from my point of view, are the following: (1) The ability to measure change in outcomes and/or exposure at the individual level, so that the researcher has the opportunity to observe individual patterns of change; (2) the temporal order of the exposure factors and the outcomes is measured. Therefore, the timing of the outcome onset can be correlated with the covariates.

Finally, there is no specific statistical package to perform this kind of analyses. Nowadays, most of them include on their recent releases all the procedures to perform, at least, a basic longitudinal analysis. Now, there is no excuse for identifying a repeated/longitudinal analysis from a descriptive one, and developing them without any doubt…. Do you agree??

Tricky bits: Ordinal data

From questionnaire responses studies to plant-flowering stages and drugs effects scoring analyses, researchers and biostatisticians often face situations when some of the variables are not continuous but present fixed ordered categories instead.

The straightforward approach would be to deal with this data taking into account its full nature. However, in practice, when trying to do the analysis, many professionals still take a continuous (interval scale) approach in order to ensure that the most familiar statistical techniques can be employed.

Carifio and Perla claim that this debate on the use and “abuse” of the so-called Likert scales has been going on for over 50 years, and state as one of the main advantages of what they refer to as “intervalist position”,  the easy access to both traditional and more complex techniques based on the former.

Other authors such as Jamieson and Kuzon Jr. et al advocate for tailored procedures to deal with this kind of variables. Their argument being based on the need to reflect the ordered nature of the data and the difficulty of measurement of distances between the different categories within variables in the case of a continuous approach.

divided

Concerns regarding plotting this data have often been raised too. From pie to divided bar charts (see figure above and other examples here and here), it seems difficult to decide which form of visualisation is more appropriate and easily understandable.

Thankfully, the arrival of new graphical applications and user-friendly specialised software like R packages ordinal and MCMCglmm, is helping bringing consensus closer and closer so we can all be soon speaking the same “ordinal language”.

For or against? I look forward to reading your views on this!!

Note: A highly recommended book on the topic is Agresti, A. (2010). Analysis of Ordinal Categorical Data (2nd ed), Wiley.

2nd Biostatnet General Meeting Review

With a marked focus on young researchers in particular and health-related Biostatistics in general, this 2nd Biostatnet General Meeting, celebrated in Santiago de Compostela (Spain) the 25th and 26th of January, has been a fantastic opportunity for the Network´s members to gather together and discuss common topics of concern as well as successful stories.

FreshBiostats bloggers participated actively and now want to make our readers witnesses of this stimulating event.

7 of the 8 Biostatnet´s main researchers

After the welcome and opening session chaired by Carmen Cadarso, focusing on presentations on the past activities of the Network by Emilio Letón, David Conesa, Inmacularada Arostegui, and Jordi Ocaña, a busy program of events was fitted in a day and a half conference-like event:

Young researchers oral communications

Because of the meeting´s high participation, oral communications by young researchers of Biostatnet, were divided into three sections:

  • BIO session

The topics discussed in this first parallel session were the choice of primary end-points by using a web application interface by Moisés Gómez-Mateu, the modeling of a non proportional hazard regression by Mar Rodríguez-Girondo, and the randomization tests implemented in clinical trials by Arkaitz Galbete. The second part of the session continued with two talks on Chronic Kidney Disease but from two different approaches: the first one, from a survival analysis (competing risks analysis) point of view, was presented by Laetitia Teixeira, and the second one, based on longitudinal analysis (Bayesian longitudinal models), was defended by Hèctor Perpiñán. Finally, Mónica López-Ratón presented his work on estimation of generalized symmetry pointS for classification in continuous diagnostic tests. This session was moderated by Carles Serrat.

  • STAT session

A varied arrangement of talks were framed within the STAT session that featured the interesting view of Joan Valls on the experience of the biostatisticians working in the IRBLleida, two applications of Structured Additive Regression (STAR) models by Elisa Duarte and Valeria Mamouridis, a comparative analysis of different models for the prediction of breast cancer risk by Arantzazu Arrospide, an optimal experimental design application presented by Elvira Delgado, and a simulation study on the performance of Beta-Binomial SGoF multitesting method under dependence.

  • NET session

In this third parallel session, topics such as “bio” research as well as others related to design of experiments were covered. Irantzu Barrio started with a talk on development and implementation of a methodology to select optimal cut-points to categorise continuous covariates in prediction models. Also in this session, Mercedes Rodríguez-Hernández presented her work on D-optimal designs for Adair models.

Also covering “bio” topics,  a talk on derivative contrasts in quantile regression was given by Isabel Martínez-Silva. María Álvarez focused afterwards on the application of the method of maximum combination when comparing proportions. The two last communications dealt with the cost-efectiveness study of treatments for fracture prevention in postmenopausal women by Nuria Pérez-Álvarez, and the application of Generalised Additive Mixed Models for the assessment of temporal variability of mussel recruitment, by María P. Pata.

 Congratulations to the happy winners!!

To conclude these three sessions, Moisés Gómez-Mateu and Irantzu Barrio,  the two winners of both ERCIM´12 Biostatnet invited sessions, received their awards (see picture above).

Posters sessions

Two posters sessions were also included within the hectic program of the meeting, covering a wide range of topics varying, for instance, from the analysis of clinical and genetics factors (Aurora Baluja) to a collective blogging experience like ours (find it here)

As a courtesy to the young researchers participating in the meeting, Biostatnet´s main researchers gave each of us  The Cartoon Guide to Statistics, which definitely finds the fun side of Statistics (see the snapshot below for a nice example ;P). We are very grateful for this gift and promise to make good use of it, maybe by trying to convince those that are still skeptical about the enjoyable side of this amazing science!

Image extracted from “The Cartoon Guide to Statistics”

Roundtables

Throughout the meeting a total of 5 sessions of roundtables and colloquiums took place. Both professionals in the field of Biostatistics as well as young researchers participated and offered their views on different topics.

“Biostatisticians in biomedical institutions: a necessity?” was the first of the interventions of the meeting, which was covered by Arantza Urkaregi, Llorenç Bardiella, Vicente Lustres, and Erik Cobo. They attempted to respond the question with their professional experiences. The answer was unanimously positive.

The colloquium “Genomics, Biostatistics and Bioinformatics” was chaired by Malu Calle and featured presentations from Pilar Cacheiro (one of our bloggers), Roger Milne, Javier de las Rivas, and Álex Sánchez. They emphasized the importance of bringing together biostatistics and bioinformatics in the “omics” era, and a vibrant discussion followed regarding the definition of both terms.

The “Young Researchers roundtable” also generated a refreshing discussion about opportunities for young researchers in Biostatistics. Again, two of our bloggers, Altea Lorenzo and Hèctor Perpiñán, were involved in the session along with Núria Pérez and Oliver Valero, with Moisés Gómez as moderator and Isabel Martínez as organiser. The main conclusions reached in this table were the need for the young biostatisticians to claim their important role in institutions, the aspiration to access specialised courses on the field, and the importance of communication, collaboration, and networking.

In the second morning, another very important topic, “Current training in Biostatistics”, was presented by three professors, Carmen Armero, Guadalupe Gómez and José Antonio Roldán, who currently teach in Biostatistics masters and degrees programmes offered by Spanish universities. Some interesting collective projects were outlined and will hopefully be implemented soon.

Plenary talk

We cannot forget the invited talk on “Past and Current Issues in Clinical Trials” by Urania Dafni, director of  the Frontier Science Foundation-Hellas and Biostatistics professor and director of the Laboratory of Biostatistics of the University of Athens´ Nursing School of Health Sciences. An overall view of this hot topic and the importance of the presence of biostatisticians in the whole process of design and development of drugs was given by this reputed professional in the field. This session and the following discussion were moderated by Montserrat Rué.

Closing colloquium

Finally, a session on the future of Biostatnet and the different alternatives for development and improvement was chaired by Jesús López Fidalgo and María Durbán, with the collaboration of experts on international and national research projects funding, Martín Cacheiro and Eva Fabeiro, and Urania Dafni, director of the greek node of the International Network Frontier Sience.

From what was shown, it seems like the year ahead is going to be a very busy and productive one for the Network and its members. All that we have left to say is… We are already looking forward to the 3rd General Meeting!!

LONG LIFE TO BIOSTATNET!!!

Your comments on the Meeting and this review are very welcome,       let´s keep the spirit up!