Who can be a biostatistician?

Nowadays, Statistics and more specifically Biostatistics is increasingly becoming an important and essential tool in the area of scientific and technical research for everyone who works in very diverse contexts linked to human health, ecology, environment, agriculture, etc.

With the new advances in technology, the extraction and storage of information to create statistical databases is becoming an easier and more feasible task. That is the reason why medical researchers, biologists, chemists, and other professionals not related to Mathematics or Statistics may need to learn a range of statistical techniques to process their data. However, it would be erroneous to expect that neither mathematicians nor statisticians need additional training to know widely the work that biostatisticians carry out. Mathematical knowledge (usually mainly theoretical) is not enough. Ideally, some training in the Bio sciences area would be required too.

But, how to be a biostatistician? What kind of studies do you need? Nowadays there are many formative courses and Master’s degree aimed to biomedical researchers. Here, we will talk about some of them.

  • Basic and specific courses:

On the website of Biostatnet and the Spanish Region of the International Biometric Society (Sociedad Española de Biometría, SEB), among many others, we can find a number of basic and specific courses targeted to health researchers. These courses may be orientated towards professionals who are not statisticians (like this) or may have a more complex content (like this). The Servei d’Estadística Aplicada of the UAB (Universitat Autònoma de Barcelona), is an example of interdepartmental service with a lot of courses and seminars of different levels.
There are also public healthcare institutions, among them EVES (Valencian School for Health Studies), which sometimes give courses specially aimed to doctors and nurses. It should be underlined that many scientists work daily with simple tests, as t-student, and they need to understand it.

Furthermore, if you need better training there are also different Master’s and postgraduate degrees that offer you high specialization.

  • Master’s and postgraduate degrees:

Currently, in Spain there are few universities that offer Master’s degrees purely Biostatistics. Most of them are combined with other branches of Statistics such as Bioinformatics (one of the newest tools in Genomics). Others give simultaneously Statistics and Operations Research, where Biostatistics itself is part of the agenda. At present, the Universitat de València is the only one that offers a Master’s degree in Biostatistics.
The following are some of the Master’s degree and postgraduate courses in Statistics taught in Spanish universities:

  1. Master’s degree

Máster en Bioestadística (UV)

Máster en Bioinformática/Bioinformatics (UAB)

Máster en Bioinformática y bioestadística (UOC)

Máster en Estadística Aplicada (UGR)

Máster en Estadística e Investigación Operativa (UPC)

Máster en Técnicas Estadísticas (Interuniversity master degree between UDC, USC and UVIGO)

  1. Postgraduate courses

Máster en Metodología de la investigación: Diseño y estadística en ciencias de la salud (UAB)

Máster en Bioestadística: fundamentos de la estadística (UOC)

As we can see, in Spanish universities, Biostatistics has yet to be noticed. That is why it would be interesting that they would focus on a science that is winning more and more followers. Do you want to join? Biostatistics is breaking down walls!

Graphics: an important issue to communicate

When we think about Biostatistics we usually have in mind some more or less complex modelling examples such as linear models, generalized linear models, etc. However, part of our job is to report our results to non biostatistical collaborators and we need to be able to explain and talk about them. To do this, a great tool sometimes “forgotten”, are graphics.

In the last decades the  R – Project for Statistical Computing (known as R) software has grown more than any other, thanks to contributions (packages) that researchers around the world share with the rest of the scientific community. One of the highlights of R is its versatility and customization for performing graphics.

If you are reading these lines the probability that you know how to plot with R using base graphics like plot(), barplot(), hist (), etc., is very high. My intention in this post is to present two packages that can radically change the look of our graphics making them more  professional and nice-looking. The name of packages are lattice and ggplot2, both focus mainly on multivariate data but are flexible and support univariate data also.

Here is a self-explanatory description of lattice by its author: “Lattice is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data, that is sufficient for typical graphics needs, and is also flexible enough to handle most nonstandard requirements”.

Lattice uses a simple code very similar to the syntax in base graphics and supports 3D plots. There is a very interesting book on it called Lattice: Multivariate Data Visualization with R written by Deepayan Sarkar.

With regard to ggplot2, the author describes it as: “a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics”.

Ggplot2 uses a special syntax to construct the graphics showing an interesting way to think about plots based on the book The Grammar of Graphics by Leland Wilkinson. The creation of plots is layer by layer. By default the plots are very elegant, sober and professional, but it also allows for high customization (when the syntax is known). It doesn’t support 3D plots though. It has a website to document and explain the package and it is worth mentioning the book ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham.

To know more about lattice and ggplot2, I would recommend reading the exhaustive comparative in several posts on Learning R, starting with this one. Now it is your turn to check all this, you will love it!!!

Some issues in clinical trials design

A clinical trial is an experiment performed on human beings to measure the efficacy of a new treatment under study. The treatment could be a new drug under study, a new therapy, a surgical procedure, or any new clinical procedure that needs to be approved. Then, clinical trials play a very important role in drugs development and pharmaceutical research, because any new drug or procedure has to pass a thorough examination, often very regulated by the national regulatory drug administration of each country. Like any experiment, it has a strong statistical background in all the design, the recruiting and follow-up of patients, and the analysis of the results.

Conventionally, drug trials are classified into four phases, with each phase having a different purpose:

  •  Phase 1: Determine the potential toxicity.
  •  Phase 2: Preliminary study of efficacy and toxicity.
  •  Phase 3: Final test comparing the drug with a commonly used treatment or a placebo.
  •  Phase 4: Post approval follow-up of patient status.

Usually, the different phases are considered like separate clinical trials. Each phase of the drug approval process can be considered as a separate clinical trial and it requires different statistical analysis.

Phase 1 and 2 cover small to moderate size experiments (20-50 patients) and they are centered in the determination of toxicity and efficacy, so the final aim is to get an estimation of the toxicity-efficacy curve. Usually, different doses are tested in the patients and the measurement of the responses gives an estimation of the optimal dose to ensure the maximum efficiency without producing toxicity. There are many designs for these phases, based in optimality criteria, the use of Markov Chains…
Phase 3 is the longest one in the trial; it can have thousands of patients involved, and is also the most complex. As we stated before, the new treatment is compared against commonly used treatments or a placebo, so we have to assign the different treatments to the patients that start the trial. There is a wide catalogue of phase 3 designs in the literature; an exhaustive review is given in Rosenberger and Lachin (2002). If the drug successfully passes through Phases 1, 2, and 3, it is approved by the regulatory agency. Finally, Phase 4 involves delineating additional information, including monitoring the treatment’s risks, late-developing side-effects, benefits, and optimal use.

In the process of designing a clinical trial we have to deal with different issues. For example, in phase 3, the principal objective is to provide an unbiased comparison of the difference between treatments.  We have to avoid the different biases that appear in the study. These biases can come from patients, physicians or some unknown covariates among other factors. A powerful tool to avoid this problem is the random assignment of patients. This kind of trials are called randomized clinical trials and they use different probability rules in the assignment of treatments to patients. However, randomization alone does not avoid all biases, for example, wherever possible, clinical trials should be double-masked, i.e.,  neither the patient nor the physician should know the treatment that has been allocated to the patient.

Finally, although it is well known the importance of the use of statistical tools to carry out any experiment, in these cases, due to their complicated structure and strict regulation they become essential in order to make rigorous and efficient clinical trials.

Approaching Statistical Genomics

I am sure you heard about the ENCODE project. It has been all around the news last month. Along with other milestones like the Human Genome Project, HapMap or 1000 Genomes, it is a good example of the level of understanding of the human genome we are achieving.

Next Generation Sequencing (NGS) allows DNA sequencing at an unprecedented speed. Genomic projects involve mainly exome (protein coding regions of the genome) sequencing right now, but the technology is rapidly evolving, and soon enough it will be cost-efficient to sequence whole genomes. Undoubtedly these projects will account for a good part of genomics research fundings.

So far a quick and brief overview of what is happening in genomics right now and what is about to come in the near future. But, what does all this mean from a statistical point of view? To say it plain and simple: a huge amount of data will need to be properly analyzed and interpreted.

Between 20.000 and 50.000 variants are expected per exome. Examining an individual´s exome in the search for disease-causing mutations requires advanced expertise in human molecular genetics. We could wonder what happens when we talk about comparing multiple sequence variants among members of families (e.g. linkage analysis for monogenic disorders) or populations (e.g. case-control studies for complex disorders). High dimension data are nowadays the rule, and sooner or later anyone working in genomics will face problems that require knowledge in bioinformatics and in specific statistical methods to be solved.

Since one of my fields of interest is the identification of susceptibility genes for complex disorders, I thrive on the new challenges that NGS presents, in particular the possibility to perform rare variants analysis. Ron Do et al. have just published a complete review on this subject.

I am just focusing here on what is usually referred to as tertiary analysis in a NGS pipeline, i.e. analyzing and extracting biological meaning of the variants previously identified. However, we should not forget the opportunities in the development of base calling, sequence alignment or assembly algorithms.

Furthermore,  DNA/exome-sequencing is just one piece of the cake. Some other statistical issues arise in the analysis of other high-throughput “omics” data such as those coming from RNA-seq, ChIP-seq or Methylation-seq studies.

The message of this post: to date, the capacity for generating genomic data is far beyond the ability to interpret that data. Whether you are interested in developing new statistical methods or considering a more applied career, there is no doubt that statistical genomics is a hot field right now!

As an extra incentive for those coming from a mathematical background, you will get to work closely with geneticists, molecular biologists, clinicians and bioinformaticians among others. Interdisciplinarity being one of our blog mottos, statistical genomics wins by far…