FreshBiostats birthday and September-born famous statisticians

With the occassion of the 1st birthday of FreshBiostats, we want to remember some of the great statisticians born in September and that have contributed to the “joy of (bio)stats”.

Gerolamo Cardano Pavia, 24 September 1501 – 21 September 1576 First systematic treatment of probability
Caspar Neumann Breslau, 14 September 1648 – 27 January 1715 First mortality rates table
Johann Peter Süssmilch Zehlendorf, 3 September 1707 – 22 March 1767 Demographic data and socio-economic analysis  
Georges Louis Leclerc (Buffon) Montbard, 7 September 1707 – Paris, 16 April 1788 Premier example in “geometric probability” and a body of experimental and theoretical work in demography
Adrien-Marie Legendre Paris, 18 September 1752 – 10 January 1833 Development of the least squares method
William Playfair Liff, 22 September 1759 – London, 11 February 1823 Considered the founder of graphical methods of statistics (line graph, bar chart, pie chart, and circle graph)
William StanleyJevons Liverpool, 1 September 1835 – Hastings,13 August 1882 Statistical atlas – graphical representations of time series
Anders Nicolai Kiaer Drammen, 15 September 1838 – Oslo, 16 April 1919 Representative sample
Charles Edward Spearman London, 10 September 1863 – 17 September 1945 Pioneer of factor analysis and Spearman´s Rank correlation coefficient
Anderson Gray McKendrick Edinburgh, September 8, 1876 – May 30, 1943 Several discoveries in stochastic processes and collaborator in the path-breaking work on the deterministic model for the general epidemic
Maurice Fréchet Maligny, 2 September 1878 – Paris, 4 June 1973 Contributions in econometrics and spatial statistics
Paul Lévy 15 September 1886 – 15 December 1971 Several contributions to probability theory
Frank Wilcoxon County Cork, 2 September 1892 – Tallahassee, 18 November 1965 Wilcoxon rank-sum tests, Wilcoxon signed-rank test
Mikhailo Pylypovych Kravchuk Chovnytsia, 27 September 1892- Magadan, 9 March 1942 Krawtchouk polynomials, a system of polynomials orthonormal with respect to the binomial distribution
Harald Cramér Stockholm, 25 September 1893 – 5 October 1985 Important statistical contributions to the distribution of primes and twin primes
Hilda Geiringer Vienna, 28 September 1893 – California, 22 March 1973 One of the pioneers of disciplines such as molecular genetics, genomics, bioinformatics,…
Harold Hotelling Fulda, 29 September 1895 – Chapel Hill, 26 December 1973 Hotelling´s T-squared distribution and canonical correlation
David van Dantzig Rotterdam, 23 September 1900 -Amsterdam,  22 July 1959 Focus on probability, emphasizing the applicability to hypothesis testing
Maurice Kendall
Kettering, 6 September 1907 – London, 29 March 1983 Random number generation and Kendall´s tau
Pao-Lu Hsu Peking, 1 September 1910 – Peking, 18 December 1970 Founder of the newly formed discipline of statistics and probability in China

It is certainly difficult to think of the field without their contributions. They are all a great inspiration to keep on learning and working!!

Note: you can find other interesting dates here.

Update: and Significance´s timeline of statistics here.

Any author you consider particularly relevant? Any other suggestions?


Infographics in Biostatistics

Although the history of Infographics according to Wikipedia does not seem to mention Fritz Kahn as one of the pioneers of this technique, I would like to start this post mentioning one of the didactic graphical representations of this Jewish German doctor, who was highly reputed as a popular science writer of his time.

Apart from his fascinating views of the human body in the form of machines and industrial processes, I am particularly attracted by his illustration below, summarising the evolution of life in the Earth as a clock in which the history of humans would not take more than a few seconds…

Animal clock

Image extracted from the printed version of the article “Fritz Kahn, un genio olvidado” published in El País, on Sunday 1st of September 2013.

 What could be understood by some as a naive simplification of matters requires, in my opinion, a great deal of scientific knowledge and it is a fantastic effort to communicate the science behind very complex mechanisms.

This and more modern infographic forms of visualisation represent an opportunity for statisticians –and more specifically biostatisticians-, to make our field approachable and understandable for the wider public. Areas such as Public Health (see here), Cancer research (find examples here and here), and Drug development (see here) are already using them, so we should not be ashamed to make of these “less rigorous” graphical representations an important tool in our work.

Note: There are plenty of resources available online to design a nice infographic in R. For a quick peek into how to create easy pictograms, check out this entry in Robert Grant´s stats blog. Also, the wordcloud R package will help you visualising main ideas from texts…

We will soon show a practical example of these representations in this blog, keep tuned!


An example of Principal Components Analysis

The last post that I published was about two techniques of Multivariate Analysis: Principal Component Analysis (PCA) and Correspondence Analysis (CA). In this post I will show a practical example of PCA with R. Let’s go!

We are going to work with Fisher’s Iris Data available in package “datasets”. This data, collected over several years by Edgar Anderson was used to show that these measurements could be used to differentiate between species of irises. That data set  gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris.  The species are Iris setosa, versicolor, and virginica.

You can load the Iris data and examine this data frame with:

str(iris); summary(iris[1:4])
pairs(iris[1:4],main="Iris Data", pch=19, col=as.numeric(iris$Species)+1)
mtext("Type of iris species: red-> setosa; green-> versicolor; blue-> virginica", 1, line=3.7,cex=.8)


We will use “prcomp” R function to carry out the analysis, which is similar to “princomp” function.

As we said in the last post, PCA is used to create linear combinations of the original data that capture as much information in the original data as possible. For that and before starting with PCA is convenient to mention some particularities of this methodology.

In the prcomp function we need indicate if the principal components are calculated through correlation matrix (with standardized data) or covariance matrix (with raw data). We will standardize our variables when these have different units and have very different variances. If they are in the same units both alternatives are possible. In our example all variables are measured in centimetres but we will use the correlation matrix for simplicity’s sake.

#To examine variability of all numeric variables
# maybe this range of variability is big in this context.
#Thus, we will use the correlation matrix
#For this, we must standardize our variables with scale() function:
iris.stand <- as.data.frame(scale(iris[,1:4]))
sapply(iris.stand,sd) #now, standard deviations are 1

Now applied the prcomp() function to calculate the principal components:

#If we use prcomp() function, we indicate 'scale=TRUE' to use correlation matrix
pca <- prcomp(iris.stand,scale=T)
#it is just the same that: prcomp(iris[,1:4],scale=T) and prcomp(iris.stand)
#similar with princomp(): princomp(iris.stand, cor=T)
#This gives us the standard deviation of each component, and the proportion of variance explained by each component.
#The standard deviation is stored in (see 'str(pca)'):

In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:

#plot of variance of each PCA.
#It will be useful to decide how many principal components should be retained.
screeplot(pca, type="lines",col=3)

scree_plotPCAFrom this plot and from the values of the ‘Cumulative Proportion of Variance’ (in summary(pca)) we can conclude that retaining 2 components would give us enough information, as we can see that the first two principal components account for over 95% of the variation in the original data.

#The loadings for the principal components are stored in:
pca$rotation # with princomp(): pca$loadings

This means that the first two principal component is a linear combination of the variables:

PC1 = 0.521*Z_1 - 0.269*Z_2 + 0.580*Z_3 + 0.564*Z_4

PC2 = -0.377*Z_1 - 0.923*Z_2 - 0.024*Z_3 - 0.066*Z_4

where Z_1, \ldots, Z_4 are the standardization of original variables.

The weights of the PC1 are similar except the associate to Sepal.Width variable that is negative. This component discriminate on one side the Sepal.Width and on the other side the rest of variables (see biplot).  This one principal component accounts for over 72% of the variability in the data.

All weights on the second principal component are negative. Thus the PC2 might seem considered as an overall size measurement. When the iris has larger sepal and petal values than average, the PC2 will be smaller than average. This component explain the 23% of the variability.

The following figure show the first two components and the observations on the same diagram, which helps to interpret the factorial axes while looking at observations location.

#biplot of first two principal components
abline(h = 0, v = 0, lty = 2, col = 8)


To interpret better the PCA results (qualitatively) would be useful to have the opinion of an expert in this area, as sometimes is somewhat confusing. I encourage you to participate!