Simulation, a good option to calculate/estimate probabilities

Much of the work we do is based on statistical models used to reproduce real-life processes, yet it is not the only option. When we know the rules and restrictions that dominate a system, we can reproduce their behaviour using simulation techniques and with a few simple statistical calculations, we can get to know everything that we are interested in the system.

Coinciding with the start of the Spanish League 2013-2014 two weekends ago, myself and a group of friends were thinking about a series of questions about the future results of the League:

  1. Will we have a League of two (Barcelona and Madrid)?

  2. What are the chances of my team to win the League (I am a supporter of Valencia)?

  3. Assuming that Madrid or Barcelona will win League, what options have the other teams to obtain the second position?

To answer these questions, I was challenged  to simulate the current League using information from the last 5 seasons:

  • I downloaded the general statistics of the last five seasons from www.as.com website. With this information, I calculated the odds of home win (HW), home draw (HD), home loss (HL), away win (AW), away draw (AD) and away loss ​​(AL) for each of the 20 current teams in Spanish League. (Note: Elche probabilities are the average of probabilities from the 10 teams that have been on the First Division in the last five leagues but are not on that division in the 2013-2014 league.)

  • From the League schedule I have calculated the probability of win, draw or loss for each league matches, being given for each match the names of the local and the visitor teams.

  • I simulated 10000 leagues from which I calculated the probabilities Pr(Win), which is the probability of winning the current league; Pr(Champ), the probability to obtain a position between first and fourth (Champions League); Pr(Eur) is the probability of entering European competitions (Champions or Europa League) ;and Pr(ncc) is the probability of  not changing category.


Pr(Barcelona 1st and Madrid 2nd or Madrid 1st and Barcelona 2nd) = Pr(Barcelona 1st and Madrid 2nd) + Pr(Madrid 1st and Barcelona 2nd) = 0.2930 + 0.2428 = 0.5358


Figure 1: Boxplot of the Points per Team in the 10000 Leagues simulated.

Besides the obvious conclusions we can draw from these results, we can see that we are clearly in a league of two. This sort of procedure will also allow us to emulate complex systems in which we know the rules for calculating the corresponding probabilities. For example, in Biostatistics we could work out the probabilities of an offspring being affected by a genetic disease if we know the probabilities of the parents being carriers (genetic problem).

If you are interested in the topic Simulation and Sports, I recommend reading the paper “On the use of simulation methods to compute probabilities: application to the Spanish first division soccer league” of Ignacio Díaz-Emparanza and Vicente Núñez-Antón, which explains in much more detail how to approach this problem from different points of views.

I have implemented all the calculations for this post with free software R.

Do you have experience in simulation? Tell us about it!!!


Keeping up to date with research in your field (Part II)

  • Online free courses/tutorials: there is plenty of material on line, which makes it sometimes difficult to filter what is really worthy. Here again, tips from blogs or colleagues from your network might serve as reference. Coursera is, in my opinion, one of the best platforms, due to the quality and versatility of its courses. There are several excellent courses related to Statistics and Data Analysis. Some of them are more general about R programming (e.g. Data Analysis,Computing for Data Analysis – both using R- ),but  there are also more specific ones (e.g. Design and Interpretation of Clinical Trials, Statistical Analysis of fMRI Data,.. you can check the full list here.

I would like to mention here some other resources available for those with a math/statistics background who might be interested in getting some insight into genetics. As we mentioned previously in other posts, it is critical to understand the data you are dealing with and these websites will help you with that:

An extensive list of additional Online Genetics Education Resources can be found at the NHGRI site

For those wanting to get an introduction to NGS, there is a Next Generation Sequencing Practical Course at EMB-EBI Train online. A more advanced tutorial, showing the use of R/Bioconductor  packages for High-Throuput Sequence Analysis can be found here.

There are, obviously, countless courses and tutorials about R and specific packages. Besides, GitHub is becoming more and more popular.By creating Gist on GitHub you can share your code quickly and easily, see a quick example here.

  • Webinars:  many commercial sites offer highly focused free Webinars that might be of interest. For instance both Science and Nature host webcasts regularly.

  •  Forums /discussion list: when you are stuck with something and you are not able to find a solution, specialized forums might come to the rescue. Either because your same question has been asked before, or because there is someone willing to help, you will most likely get your doubt solved. Two forums are particularly useful in my field, BioStar and SEQanswers. Talking about R programming, R-help from R Mailing List and Stack Overflow are two of the sites where you can found most of your doubts solved. Our life without them would be much more difficult for sure…

As I mentioned at the beginning of the previous post, it is sometimes difficult to find a balance between the time you spend learning and your more “productive” time. Besides for those of us whose work is also a passion, the line between work and personal interests becomes blurred quite often. And so we will spend much of our leisure time diving around new stuff that eventually will be useful in our work. Some might argue that the time spent in training or the amount of information you have access to might be overwhelming. Is it worth the effort? How much time should we invest in learning? Are we able to take advantage of what we learn? You can take a look at this video  for more elaborate thoughts on the subject.

I hope the information contained in these posts might be useful… Your suggestions on additional resources will be more than welcome!


Keeping up to date with research in your field (Part I)

No doubt about it, we must keep up with news and advances in our area of expertise. In this series of two posts I just want to introduce the ways I find useful in order to achieve this goal. Staying up-to-date means not only knowing what is being done in your field but also learning new skills, tools or tricks that might be applied. I will save for last some thoughts about getting a proper work-learning balance and potential impact on productivity.

  • Blogs. It might be an obvious one, but it is for sure one of my main sources of information. Several blogs I follow include: Getting Genetics Done, Genomes Unzipped, Our 2 SNPs, Wellcome Trust, R-BloggersSimply Statistics and many others mainly focused on biostatistics that you can find in our blog roll. Most of them are accessible through RSS feeds, if not through mail subscription.
  • Twitter. Most blogs have also a twitter account where you can follow their updates (so it might be an alternative). You can follow twitter accounts from networks of interest, companies or people working in your field too. For some ideas on whom to follow, go to our twitter!
  • PubMed / Journals alerting services. A keyword specific PubMed search can be just as relevant. Both available through email and RSS Feeds, you will get updates containing your search terms (for instance “Next Generation Sequencing”, “rare variant association analysis”, “Spastic Paraplegia”…). You can also get information about an author´s work or the citations of a given paper. You can find here how to do it.  An alternative is to set up alerts for Table of Contents of your journals of interests, informing of the topics of latest papers (Nature Genetics, Bioinformatics, Genome Research, Human Mutation, Biostatistics…) Accessing RSS Feeds through your mail app is straightforward -Mozilla Thunderbird in my case-.
  • Professional networking sites. Obviously, when it is all about networking, having a good network of colleagues is one of the best ways to keep up with job offers, news or links to resources. For instance through my LinkedIn contacts I receive quite a bunch of useful tips. Well selected LinkedIn groups are also a source of very valuable information and news, as well as companies in your area or work (pharma industry, genomic services, biostatistics/bioinformatics consulting). This is a more general site, but there are other professional sites focused on Research: ResearchGate and Mendeley. Mendeley in particular, apart from a networking site is an excellent reference manager. This, along with MyNCBI are the two main tools I use to keep my bibliography and searches organized.
  •  Distribution lists.  Apart from general distribution lists including one´s institution or funding agency, more specific newsletters or bulletins from networks as Biostatnet or  scientific societies you belong to, are a good source of news, events and so on, or even more restricted ones (for instance in my institution an R users list has been recently created).

To be continued next week …..


A brief introduction to the SAS® System: what is and how does it work?

In previous posts on this site, comparisons among the most used statistical packages available for scientists were conducted, pointing out their strengths and weaknesses.

Although nowadays there is a trend to use open source programs for data science development, there are some commercial statistical packages which are also important since they make our scientist life easier. One of them is called Statistical Analysis System® (SAS).

SAS System was founded in 1970s and since then it is leading product in data warehousing, business analysis and analytical intelligence. It is actually the best  selling all-in-one database software. In other words, SAS can be described as an Exporting-Transformation-Loading (ETL), Reporting and Forecasting tool. This makes SAS a good option for Data warehousing. It could be also considered as a statistical software package that allows the user to manipulate and analyze data in many different ways.

The main basic component of the SAS program is the Base SAS module which is the part of the software designed for data access, transformation, and reporting. It contains: (1) data management facility – extraction, transformation and loading of data – ; (2) programming language; (3) statistical analysis and (4) output delivery system utilities for reporting usage. All these functions are managed by means of data and call procedures. In the following sections some introductory and basic examples will be described:

(a) Generating new data sets.

It is possible to generate new data sets by using the SAS Editor environment (place where the code is written and executed). Suppose we want to create a data set of 7 observations and 3 variables (two numerical and one categorical).

data one;
input gender $ age weight;
F 13 43
M 16 62
F 19 140
F 20 120
M 15 60
M 18 95
F 22 75

The SAS code shown above will create the desired data set called “one”. As gender variable is categorical – the first variable-,  its values are “M” for MALE and “F” for FEMALE. The dollar sign ($) is used when you have a text variable rather than a numerical variable  (i.e., gender coded as M, F rather than as 1 denoting male and 2 denoting female).

(b) Statistical analysis

For the performance of a basic descriptive analysis, – percentages and mean computations – the next procedures should be executed:

·         For frequencies and percentage computation.

proc freq data = one;
tables gender;

·         Descriptive analysis for continuous variables.

proc means data = one n mean std;
var age weight;

As it can be observed, it is really easy to remember which statistical procedure to use according to the type of variable: for categorical data proc freq procedure  and proc means  for continuous data.

(c) Output

Another important issue is the way the results are shown. The SAS statistical package has improved this section. Before the release of the version 9.0 – I think so!- , one had to copy and paste all the results from the output window to a WORD document in order to get a proper and saved version of the results. From version 9.0, things have changed: All results can be delivered to a PDF, RTF or even HTML format. As SAS user, I can say that this has been a good idea, since not only I no longer have to waste lots of time doing copy-paste, but also the not so useful results can be left unprinted. That has been a great leap!!!  For example, if you want to have the results in PDF format, you should follow the next instructions:

ods pdf;
proc means data = one n mean std;
var age weight;
ods pdf close;

This code generates a PDF file where shows the mean and standard deviation of the age  and  weight  variable of the one data set.

Because of its capabilities, this software package is used in many disciplines, including the medical sciences, biological sciences, and social sciences. Knowing the SAS programming language will likely help you both in your current class or research, and in getting a job. If you want to go further in SAS programming knowledge, The Little SAS Book by Lora Delwiche and Susan Slaughter is an excellent resource for getting started using SAS. I also used it when I started learning SAS. Now it is your turn!!

Have a nice holiday!