When faced with a statistical analysis task, we usually begin with a simple description about our data that will allow us to analyze and interpret our variables, with the aim of making a decision on some hypotheses that had been made at the start of the study.
This post deals with some statistical methods that are regarded as exploration techniques, but that go further than simple and usual descriptive. This talk will focus on correspondence analysis (CA) and principal components analysis (PCA), both are central to Multivariate Analysis.
PCA and CA are usually applied in high dimensional datasets with the principal objective being to reduce the dimensions of the data. Although these methods have their own particularities, both conclude to explain latent variables in the problem through observed data.
PCA: is widely used to capture essential patterns of big datasets. In high dimensional data sometimes it is difficult for researchers to extract interesting features, so one way to solve it is to reduce its dimensionality at the expense of losing information. It works by creating new uncorrelated variables (named as PCA) through linear combinations of original variables (in general correlated).
These PCA collect all information than original variables, and the goal is to select some PCA by preserving as much data variance as possible.
CA: unlike PCA, this methodology is applied in categorical data (without calculate linear combinations) as a procedure to analyze contingency tables. CA allows us to describe the relation between two nominal variables as well as the relation between the levels of themselves in a Cartesian axis.
The extension of correspondence analysis to many categorical variables is called multiple correspondence analysis.
The applications of PCA and CA are wide and varied, in fields such as biology, ecology, social sciences, psychology, image processing … in which the number of variables is big. As we have said before, in that situation the PCA and CA provide us a method to extract latent variables and population intrinsic characteristics that have not been observed, so that we can think in these as a hypothesis generation system.
With ‘ca’ (‘mjca’) and ‘princomp’ packages of R we can apply Simple (Multiple) Correspondence Analysis and Principal Components Analysis into our data. The following figure illustrates a typical graphic of PCA, representing the first two components.
Here I have briefly commented a little aspects of two procedures to describe large datasets. In following posts I will try to do an example using R. Meanwhile, you can try and play with ‘ca’ and ‘princom’ functions.