Featured

# An example of Principal Components Analysis

The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go!

We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set  gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.  The species are Iris setosa, versicolor, and virginica.

You can load the Iris data and examine this data frame with:

data(iris)
str(iris); summary(iris[1:4])
pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(iris\$Species)+1)
mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)

We will use “prcomp” R function to carry out the analysis, which is similar to “princomp” function.

As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.

In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.

#To examine variability of all numeric variables
sapply(iris[1:4],var)
range(sapply(iris[1:4],var))
# maybe this range of variability is big in this context.
#Thus, we will use the correlation matrix
#For this, we must standardize our variables with scale() function:
iris.stand <- as.data.frame(scale(iris[,1:4]))
sapply(iris.stand,sd) #now, standard deviations are 1

Now applied the prcomp() function to calculate the principal components:

#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix
pca <- prcomp(iris.stand,scale=T)
#it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand)
#similar with princomp(): princomp(iris.stand, cor=T)
pca
summary(pca)
#This gives us the standard deviation of each component, and the proportion of variance explained by each component.
#The standard deviation is stored in (see 'str(pca)'):
pca\$sdev

In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:

#plot of variance of each PCA.
#It will be useful to decide how many principal components should be retained.
screeplot(pca, type="lines",col=3)

From this plot and from the values of the ‘Cumulative Proportion of Variance’ (in summary(pca)) we can conclude that retaining 2 components would give us enough information, as we can see that the first two principal components account for over 95% of the variation in the original data.

#The loadings for the principal components are stored in:

This means that the first two principal component is a linear combination of the variables:

$PC1 = 0.521*Z_1 - 0.269*Z_2 + 0.580*Z_3 + 0.564*Z_4$

$PC2 = -0.377*Z_1 - 0.923*Z_2 - 0.024*Z_3 - 0.066*Z_4$

where $Z_1, \ldots, Z_4$ are the standardization of original variables.

The weights of the PC1 are similar except the associate to Sepal.Width variable that is negative. This component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot).  This one principal component accounts for over 72% of the variability in the data.

All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. When the iris has larger sepal and petal values than average, the PC2 will be smaller than average. This component explain the 23% of the variability.

The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.

#biplot of first two principal components
biplot(pca,cex=0.8)
abline(h = 0, v = 0, lty = 2, col = 8)

To interpret better the PCA results (qualitatively) would be useful to have the opinion of an expert in this area, as sometimes is somewhat confusing. I encourage you to participate!

## 7 thoughts on “An example of Principal Components Analysis”

1. Nice post! I like the code snippets and explanation. As you also mention, I’ve found qualitative interpretations of PCA results a little unintuitive. To make it a little easier, I created this visualization of PCA results.

This was done in R as well. I’ll have to dig out the code and post it in a follow up, but I hope it prompts some ideas and is helpful!

• Thank you very much, Josh for your comment! I’ve seen your post and I think it is a clear and original way to represent PCA results. As you say the interpretation could easily become tedious!
I like your blog very much!

• Hi Josh, could you please send me the PCA link again, as the link in your comment seems to be broken

Cheers!
Avish

2. I have found this as a great example for how to produce a PCA. One thing that I was a little unsure of was how the precise % of variance explained by each principle component was determined. I also like very much the heat map idea for data exploration purposes.

3. excellent tutorial Silvia- you are a gifted teacher! ajit

4. From the example above, 1). on the biplot how do we label the factors or groups if we have such a column and how do we also apply different colors to the different groups/factor levels e.g. 00/01T1, 01/02T1, …00/01T2, 01/02T2, ….e.t.c 00/01T1 meaning season/year 2000/2001 at temperature class 1, 00/01T2 meaning season/year 2000/2001 at temperature class 2? Which r code do we use for that additional step?

2). which r code do we use to obtain factor scores?