Science, as other areas of knowledge, has had its own evolution: Before 1600, it was referred to as empirical science. The second period, from 1600 to 1950 approximately, is the called theoretical period, where each discipline developed theoretical models which often motivated experiments and broadened our understanding. Afterwards, the computational part came, and after 40 years, these disciplines have made the computational branch grow based on simulations to find solutions for complex mathematical models. Since 1990, and after the spread of computer implementation over the world, a new period has started: as the technology advances, the size and number of experimental data sets are increasing exponentially, mainly thanks to the ability to economically store and manage petabytes (more than terabytes!) of data and to its easy accessibility. For instance, in modern medicine, there is now a well-established tradition of depositing scientific data into a public repository, and also of creating public databases for use by other scientists: researchers collect huge amounts of information about patients through imaging technology – CAT scans-, DNA microarrays, etc… This is what we call big data.
But… what about the process of analyzing this kind of data? How to handle this amount of data? As a biostatistician and at first glance it does not seem like a complex problem to manage. Essentially, the main procedure is based on the extraction of interesting (non-trivial, previously known and potentially useful) patterns from the big data set. This is what they call data mining or machine learning process. This technique was initially applied in business (e.g., to identify the profile of customers of a certain brand) but nowadays, the impact of data abundance extends well beyond this discipline.
There are several reasons that support the usage of huge data sets but, mainly: (1) they allow to relax assumptions of linearity or normality of the variables collected in the databases; (2) we can identify rare events or low incidence populations; (3) data analysts can generate better predictions and better understanding of effects.
As for the functionalities of the data mining technique, there are numerous patterns to be mined but, I will focus on those which are more relevant and applied to biomedical sciences: classification, cluster analysis and outlier analysis. In the first one, data analysts constructs models to describe and distinguish classes or concepts for future prediction. The typical methods within classification are the well-known decision trees or logistic regression models. On the contrary, cluster analysis groups data to form new categories to find distribution patterns, maximizing intra-class and minimizing interclass similarity at the same time. In next posts I will give a more extensive explanation of certain methods belonging to these special functionalities…
Finally, it must be pointed out that this kind of data sets can be analyzed by means of the most used statistical software (e.g., SPSS, SAS or R) and although sometimes some operations may be CPU intensive, it is definitely worth the effort to find out a solution to this ongoing challenge: big data analysis.
Are you prepared to face it?