Dose Finding Experiments

In a past entry, I spoke about some issues in clinical trial design, explaining their structure and different phases.  Now, I am focusing in a part of these trials, the two first phases, which could be also a complete experiment by themselves, i.e. dose finding experiments.

The aim of a dose-finding experiment is a safe and efficient drug administration in humans. When a new drug (or procedure) is under study, we want to determine a safe dose of the drug for application but this dose should also be efficient. A balance between these two goals, non-toxicity and efficiency, is required in clinical trials.

Ethical concerns become essential in these experiments, same as in any experiment conducted on human beings, but especially because they are first-in-man studies, so the safety of the participants is the main worry. They also have a very small sample size, usually about 20 patients, therefore becoming a problem for the statistical analysis.

In phase 1 trials, the target is the evaluation of the maximum tolerated dose (MTD), the highest dose level with a pre-established observed toxicity rate. Depending on the risk of the experiment, a toxicity rate is fixed and the maximum dose level which does not exceed this toxicity rate is chosen. Then the recommended dose for the next phases of the study is either the MTD or one dose level less than the MTD. In phase 2 trials, we have an analogous experiment but now the target is the minimal effective dose, MED, which is the minimum dose level with a fixed efficiency rate. It is also common to try to combine these two targets in a single experiment, estimating a toxicity-efficiency curve and looking for an optimal dose that mixes these two goals.

A wide catalogue of designs for dose-finding experiments could be found in the statistical literature. The initial dose, the dose escalation, the stopping point and accuracy in the estimation of the MTD and MED are the main concerns in a design and they still are a fertile area for ​​research. A classical design in phase 1 is the traditional 3+3 design, very used in oncology experiments. In this design, patients are assigned in groups of three and the trial starts in the lowest dose level.  The first three patients are assigned and if it does not show any toxicity, we assign the next patients to the next level; if there is one case of toxicity, we repeat the experiment in the same level, and if there are two or more toxicities in the same level, we conclude that we have exceeded the MTD. This procedure is repeated until we exceed the MTD.

Phases 1 and 2  have been much less treated theoretically than phase 3 of clinical trials, but with this post I wanted to show up their  importance and try to make them more understandable to people who work in statistics.

Handling multiple data frames in R

I am not -yet- that highly skilled at programming in R, so when I run into a function/package that really meets my needs, well, that is quite a big deal. And that is what happened with the plyr package.

I often have to work with a high number of data frames. It is not about “big” statistics here, but just some basic descriptives, subsettings…  It has to do more with handling this amount of data in an easy and efficient way.

There are two routines when manipulating these data sets I find essential. The first one is being able to operate with all these data frames at once (eg. subsetting  for filtering) by creating lists.

So let´s say we have a certain number of data frames, file_1, file_2, … each of them with the same variables named  var1, var2,… and want to subset all of them based on a certain variable.

dataframes <- list.files(pattern = ”file_”)
list_dataframes <- llply(dataframes, read.table, header = T, sep = "\t")
dimensions <- ldply(list_dataframes, dim)
filter <- llply(list_dataframes, subset, var1 == ”myvalue”)
selection <- llply(list_dataframes, subset, select = c(var1,var3))

No need for “for” loops here!  It is certain much neater and easier this way. More information about llply, ldply or laply can be found at the plyr R tutorial. Much has been said about its advantages in other blogs, you can check it here or in my “indispensable” gettinggeneticsdone.

The second one would allow us to identify common values between those data frames. The first function that comes to mind is merge. Again several useful posts about it (gettinggenetics done, r-statistics) and it sure serves the purpose in many cases. But quite frequently, you find yourself in the situation where you have got several data frames to merge ;  merge_all and merge_recurse in the reshape package overcome this problem. There is an excellent R wiki entry covering this topic.
As an alternative to merge, join (again in the plyr package), lets you specify how duplicates should be matched.

Note that both packages- plyr and reshape- are developed by Hadley Wickham, ggplot2´s creator.

These functions have become part of my daily routine and they definitely save me a lot of trouble. I have yet to explore another package I read great things about: sqldf.

Do you have any other suggestions on manipulating data frames?

Big data analysis: a new ongoing challenge

Science, as other areas of knowledge, has had its own evolution: Before 1600, it was referred to as empirical science. The second period, from 1600 to 1950 approximately, is the called theoretical period, where each discipline developed theoretical models which often motivated experiments and broadened our understanding. Afterwards, the computational part came, and after 40 years, these disciplines have made the computational branch grow based on simulations to find solutions for complex mathematical models. Since 1990, and after the spread of computer implementation over the world, a new period has started: as the technology advances, the size and number of experimental data sets are increasing exponentially, mainly thanks to the ability to economically store and manage petabytes  (more than terabytes!) of data and to its easy accessibility. For instance, in modern medicine, there is now a well-established tradition of depositing scientific data into a public repository, and also of creating public databases for use by other scientists: researchers collect huge amounts of information about patients through imaging technology – CAT scans-, DNA microarrays, etc… This is what we call big data.

But… what about the process of analyzing this kind of data? How to handle this amount of data? As a biostatistician and at first glance it does not seem like a complex problem to manage. Essentially, the main procedure is based on the extraction of interesting (non-trivial, previously known and potentially useful) patterns from the big data set.  This is what they call data mining or machine learning process. This technique was initially applied in business (e.g., to identify the profile of customers of a certain brand) but nowadays, the impact of data abundance extends well beyond this discipline.

There are several reasons that support the usage of huge data sets but, mainly: (1) they allow to relax assumptions of linearity or normality of the variables collected in the databases; (2) we can identify rare events or low incidence populations; (3) data analysts can generate better predictions and better understanding of effects.

As for the functionalities of the data mining technique, there are numerous patterns to be mined but, I will focus on those which are more relevant and applied to biomedical sciences: classification, cluster analysis and outlier analysis. In the first one, data analysts constructs models to describe and distinguish classes or concepts for future prediction. The typical methods within classification are the well-known decision trees or logistic regression models. On the contrary, cluster analysis groups data to form new categories to find distribution patterns, maximizing intra-class and minimizing interclass similarity at the same time. In next posts I will give a more extensive explanation of certain methods belonging to these special functionalities…

Finally, it must be pointed out that this kind of data sets can be analyzed by means of the most used statistical software (e.g., SPSS, SAS or R) and although sometimes some operations may be CPU intensive, it is definitely worth the effort to find out a solution to this ongoing challenge: big data analysis.

Are you prepared to face it?

Invitation to the Biostatnet 2nd General Meeting

Hi everyone!

As active members of the National Biostatistics Network Biostatnet, we would like to invite all the young (and not so young) members to participate in the Biostatnet 2nd General Meeting that will take place the 25th and 26th of January 2013 in Santiago de Compostela. You can find the program of the event here and can request any further information at


Apart from some really interesting talks and roundtables, this meeting has a main focus on young researchers, and as such we believe it is particularly important that we participate and try to get involved in this amazing network that is setting us closer and closer despite the physical separation.

Remember!! The deadline for submission of posters and oral communications is the 14th of December, so HURRY UP!!

FreshBiostats will be represented in the roundtable that will be held on the 25th so it would be fantastic to get your comments as to which topics you would like to see discussed or any issues that affect you directly. Thank you for your collaboration!!