I am sure you heard about the ENCODE project. It has been all around the news last month. Along with other milestones like the Human Genome Project, HapMap or 1000 Genomes, it is a good example of the level of understanding of the human genome we are achieving.
Next Generation Sequencing (NGS) allows DNA sequencing at an unprecedented speed. Genomic projects involve mainly exome (protein coding regions of the genome) sequencing right now, but the technology is rapidly evolving, and soon enough it will be cost-efficient to sequence whole genomes. Undoubtedly these projects will account for a good part of genomics research fundings.
So far a quick and brief overview of what is happening in genomics right now and what is about to come in the near future. But, what does all this mean from a statistical point of view? To say it plain and simple: a huge amount of data will need to be properly analyzed and interpreted.
Between 20.000 and 50.000 variants are expected per exome. Examining an individual´s exome in the search for disease-causing mutations requires advanced expertise in human molecular genetics. We could wonder what happens when we talk about comparing multiple sequence variants among members of families (e.g. linkage analysis for monogenic disorders) or populations (e.g. case-control studies for complex disorders). High dimension data are nowadays the rule, and sooner or later anyone working in genomics will face problems that require knowledge in bioinformatics and in specific statistical methods to be solved.
Since one of my fields of interest is the identification of susceptibility genes for complex disorders, I thrive on the new challenges that NGS presents, in particular the possibility to perform rare variants analysis. Ron Do et al. have just published a complete review on this subject.
I am just focusing here on what is usually referred to as tertiary analysis in a NGS pipeline, i.e. analyzing and extracting biological meaning of the variants previously identified. However, we should not forget the opportunities in the development of base calling, sequence alignment or assembly algorithms.
Furthermore, DNA/exome-sequencing is just one piece of the cake. Some other statistical issues arise in the analysis of other high-throughput “omics” data such as those coming from RNA-seq, ChIP-seq or Methylation-seq studies.
The message of this post: to date, the capacity for generating genomic data is far beyond the ability to interpret that data. Whether you are interested in developing new statistical methods or considering a more applied career, there is no doubt that statistical genomics is a hot field right now!
As an extra incentive for those coming from a mathematical background, you will get to work closely with geneticists, molecular biologists, clinicians and bioinformaticians among others. Interdisciplinarity being one of our blog mottos, statistical genomics wins by far…