The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go!
We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
You can load the Iris data and examine this data frame with:
data(iris) str(iris); summary(iris[1:4]) pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(iris$Species)+1) mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)
As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.
In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.
#To examine variability of all numeric variables sapply(iris[1:4],var) range(sapply(iris[1:4],var)) # maybe this range of variability is big in this context. #Thus, we will use the correlation matrix #For this, we must standardize our variables with scale() function: iris.stand <- as.data.frame(scale(iris[,1:4])) sapply(iris.stand,sd) #now, standard deviations are 1
Now applied the prcomp() function to calculate the principal components:
#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix pca <- prcomp(iris.stand,scale=T) #it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand) #similar with princomp(): princomp(iris.stand, cor=T) pca summary(pca) #This gives us the standard deviation of each component, and the proportion of variance explained by each component. #The standard deviation is stored in (see 'str(pca)'): pca$sdev
In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:
#plot of variance of each PCA. #It will be useful to decide how many principal components should be retained. screeplot(pca, type="lines",col=3)
From this plot and from the values of the ‘Cumulative Proportion of Variance’ (in summary(pca)) we can conclude that retaining 2 components would give us enough information, as we can see that the first two principal components account for over 95% of the variation in the original data.
#The loadings for the principal components are stored in: pca$rotation # with princomp(): pca$loadings
This means that the first two principal component is a linear combination of the variables:
where are the standardization of original variables.
The weights of the PC1 are similar except the associate to Sepal.Width variable that is negative. This component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot). This one principal component accounts for over 72% of the variability in the data.
All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. When the iris has larger sepal and petal values than average, the PC2 will be smaller than average. This component explain the 23% of the variability.
The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.
#biplot of first two principal components biplot(pca,cex=0.8) abline(h = 0, v = 0, lty = 2, col = 8)
To interpret better the PCA results (qualitatively) would be useful to have the opinion of an expert in this area, as sometimes is somewhat confusing. I encourage you to participate!