It’s fairly common to have a lot of dimensions (columns, variables) in your data. You wish you could plot all the dimensions at the same time and look for patterns. Perhaps you want to group your observations (rows) into categories somehow. Unfortunately, we quickly run out of spatial dimensions in which to build a plot, and you probably don’t want to do clustering (partitioning) by hand.

This example will use the `iris`

data set available in R, which has four numeric variables. This is not very many, and the data is pretty nicely behaved, so the results of Principal Component Analysis and clustering will not be terribly bad. You won’t always get decent results when you try to arbitrarily reduce the dimensionality of your data to three just so you can make pretty graphs. But it sure is fun – and it can be useful, both for exploring data and communicating results.

Let’s set up and look at our data:

data(iris) # this is a little tweak so that things line up nicely later on iris$Species <- factor(iris$Species, levels = c("versicolor","virginica","setosa")) head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa

We might take a look at how correlated the four variables are.

round(cor(iris[,1:4]), 2) # Sepal.Length Sepal.Width Petal.Length Petal.Width # Sepal.Length 1.00 -0.12 0.87 0.82 # Sepal.Width -0.12 1.00 -0.43 -0.37 # Petal.Length 0.87 -0.43 1.00 0.96 # Petal.Width 0.82 -0.37 0.96 1.00

Sepal length, petal length, and petal width all seem to move together pretty well (Pearson’s *r* > 0.8) so we could possibly start to think that we can reduce dimensionality without losing too much.

We’ll use `princomp`

to do the PCA here. There are many alternative implementations for this technique. Here I choose to use the correlation matrix rather than the covariance matrix, and to generate scores for all the input data observations. (My only reference is SAS help, but basically it seems that using the correlation matrix sort of helps you worry less (vs. using the covariance matrix) about normalizing all your variables perfectly.)

pc <- princomp(iris[,1:4], cor=TRUE, scores=TRUE)

The results are stored in `pc`

and we can examine them in a variety of ways.

summary(pc) # Importance of components: # Comp.1 Comp.2 Comp.3 Comp.4 # Standard deviation 1.7083611 0.9560494 0.38308860 0.143926497 # Proportion of Variance 0.7296245 0.2285076 0.03668922 0.005178709 # Cumulative Proportion 0.7296245 0.9581321 0.99482129 1.000000000

Here we see that the first three components bring our cumulative proportion of variance to 0.99 already, which is nothing to sneeze at. You can get a similar sort of idea from a scree plot.

plot(pc,type="lines")

Heck, in this case you might even think that just two factors is enough. We can certainly plot in two dimensions. Here is a biplot.

biplot(pc)

This is pretty interesting. You can see how the original variables behave relative to our principal components, which is sort of as we saw in the correlation matrix above. We only see in the directions of the first two principal components, however. In the case of the `iris`

data we can already see pretty clear clustering here.

The loadings calculated by `princomp`

are eigenvectors of the correlation (or covariance, your choice) matrix and stored in the `loadings`

of the results (`pc$loadings`

in this example). You may prefer to use singular value deomposition for your PCA, in which case you can check out `prcomp`

instead of `princomp`

.

Let’s start to plot in three dimensions. We’ll use the excellent `rgl`

package, which you can install with `install.packages("rgl")`

if you haven’t already. We’ll plot the scores along the first three principal components for each iris, and color by species.

library(rgl) plot3d(pc$scores[,1:3], col=iris$Species)

That plot will be interactive – click and drag to rotate, right click and drag or use the mouse wheel to zoom.

It doesn’t seem like there’s a pre-made function for this, but we can sort of hack together a 3D equivalent to the biplot by adding to our initial 3D plot. This looks reasonably decent:

text3d(pc$scores[,1:3],texts=rownames(iris)) text3d(pc$loadings[,1:3], texts=rownames(pc$loadings), col="red") coords <- NULL for (i in 1:nrow(pc$loadings)) { coords <- rbind(coords, rbind(c(0,0,0),pc$loadings[i,1:3])) } lines3d(coords, col="red", lwd=4)

You really need to interact with this plot to see how everything is laid out. It’s very much like the biplot from above, but the eigenvectors are drawn on the same axes as the data.

You may also be interested in doing some unsupervised clustering. There are a bunch of ways to do this. In this case we have a “correct” clustering – the three species in the data set – so we can see how close to correct we are. Here’s the popular *k*-means method:

set.seed(42) cl <- kmeans(iris[,1:4],3) iris$cluster <- as.factor(cl$cluster)

The random seed is set for reprodicibility and then we save the cluster assignments from *k*-means as a new column in the `iris`

data frame. We can take a look at how well this works, visually and by tabulation.

plot3d(pc$scores[,1:3], col=iris$cluster, main="k-means clusters") plot3d(pc$scores[,1:3], col=iris$Species, main="actual species")

with(iris, table(cluster, Species)) # Species # cluster versicolor virginica setosa # 1 48 14 0 # 2 2 36 0 # 3 0 0 50

So *k*-means got all the setosa’s perfectly but made some mistakes with the other two species, picking far too many flowers for its cluster 1.

You may want to do some sort of hierarchical clustering. Here’s one way. (See also the Quick-R page on clustering.)

di <- dist(iris[,1:4], method="euclidean") tree <- hclust(di, method="ward") iris$hcluster <- as.factor((cutree(tree, k=3)-2) %% 3 +1) # that modulo business just makes the coming table look nicer plot(tree, xlab="") rect.hclust(tree, k=3, border="red")

In this case, the result is very similar to the result from *k*-means, but just slightly better, catching all the versicolor:

with(iris, table(hcluster, Species)) # Species # hcluster versicolor virginica setosa # 1 50 14 0 # 2 0 36 0 # 3 0 0 50

Of course with any clustering, you should probably think about how your variables are scaled before you start applying clustering methods, which I’ve just neglected here.

There’s much more you can do, with many more options and alternative techniques. Have fun!

I really enjoyed this post! Thanks! It was interesting seeing the unsupervised clustering approaches.

Thanks!

Hello

Thank you, very clear ! but I can not plot in three dimensions my PCA generated with prcomp instead of princomp … Cab you help me ?

Hi Marjorie!

Sure – ‘prcomp’ returns an object with different names for things. The ‘scores’ from ‘princomp’ are equivalent to ‘x’ from ‘prcomp’, and the ‘loadings’ from ‘princomp’ are equivalent to the ‘rotation’ from ‘prcomp’. So if you’re storing the results into ‘pc’, with ‘prcomp’ you’ll have to switch ‘pc$scores’ (above) to ‘pc$x’ and ‘pc$loadings’ to ‘pc$rotation’. Does it work for you with those changes?

Can You help me in adding labels to scatterplot3d of PCA object. I expected the points should be labeled with their rownames. but I didn’t come through.

Here Is the code used:

> scatterplot3d(PCA_3d$x[,1:3],xlab=”Component 1″,main=”3D PCA”,ylab=”Component 2″, zlab=”Component 3)”,type=”h”,box=FALSE,pch=8)

> text(PCA_3d$x, labels = rownames(PCA_3d$x),pos = 2, offset = 0.5)

Thanks in advance!

Hi! I didn’t use `scatterplot3d` at all here, did I? And I didn’t use `prcomp` either, which it seems you’re using based on that `x`.

Upon inspection, it seems that `prcomp_result$x` does not have rownames, so you probably want to find your labels somewhere else. Also, shouldn’t you probably pass the same coordinates to `text()` as to `scatterplot3d`?

Good luck!

Hi,

Thank you for your post. This is very helpful. I did this for a bigger dataset (over a million points) and it works. However, when i plot a 3D equivalent to the biplot, my text and arrows disappear (more like it got stuck in the middle of the millions of points) which make make unable to view the text and arrows of the PC loadings. Please can you suggest how can i deal with this. Thanks!

Thanks Kyra! I think you’ll probably just want to not plot the millions of points, and especially not their labels. With so many points, I suspect not even zooming in or decreasing the size of the points will be sufficient to achieve what you’re looking for…

i made two separate plot, one to show the loadings and the other to show the classification. anyway, many thanks for the post and the reply 🙂

Hi,

Thanks for your post. I’m using R studio and when i get to this part in the ppal component analysis :

text3d(pc$loadings[,1:3], texts=rownames(pc$loadings), col=”red”)

the program fails and closes.

Do you know what this could be?

Thanks in advance.

Hi Gabriela! I used RStudio as well for all of this and it worked fine for me; I wonder if it’s an issue with the rgl implementation on your platform. The issue happens every time, I imagine (replicates on your machine)? Have you already tried updating to current versions of everything? I’m not sure what else to tell you; do you get any error messages or anything? Have you tried doing it in R (without RStudio)? That would isolate whether it’s a problem with RStudio. Hmm… that’s all I can think of right now.

HI again,

I would also like to create a 3d plot with 3 of my original variables and a plane that represents the 2 main ppal components in that space… If this possible?

Thanks

Sure, that should be possible (though I’ll leave the implementation to you) – your two main principal components will have some magnitudes in this 3-space, though no guarantees on whether it’ll be interesting… From the two vectors in that 3-space you can certainly induce a plane, though this will ignore the magnitudes to some extent. I’m curious why you want to do this…

Hi, this is very helpful! Thank you so much!

I was just wondering, in the first 3-d graph of the data, graphed on the axes of the first three principal components, you suggested coloring the points by species. Why are there only three colors in your graph if you have 6 species in your data?

When I performed the same plo3d function on my data (23 rows/categories) I seem to get 23 colors.

I just want to be sure that I am doing/thinking about the function and the graph in the correct way.

Thank you!

There are only 3 species in the iris data 🙂

Ah! Thanks Matthew – I didn’t see your response until after I posted. 🙂

I see! Thanks!

Hi Erin! There are only three species in this data set – setosa, versicolor, and virginica – so it’s working as expected. If you have 23 different levels in the factor you’re using to specify color, you’ll get 23 different colors – which is probably more colors than you want in the same plot! 🙂 Does that help?

It was a misunderstanding – it makes sense now, thank you!

Interesante análisis, me a ayudado mucho, ya que el ACP es algo complicado de entender

¡Gracias!

I am using SAS to do PCA. Could you please tell me how I can plot the loadings and the the classification

Hi Sara! I don’t really blog about SAS. You can download R for free here: http://www.r-project.org/

I followed your example and got the following error when using the plot3d(pc$scores[,1:3], col=iris$species) command: Error in col2rgb(colors) : invalid color name ‘setosa’

If I don’t use color, it plots just fine. Could you direct me in fixing this error?

Thanks

Hi Shannon! Looks like you’re right – something seems to have changed! Not sure where the issue is coming from; I think there are new versions of everything from MacOS to XQuartz to rgl to R since I wrote this. Anyway, that code had been relying on Species (a factor) being treated as an integer. You can make it work by wrapping iris$Species with as.integer().

Thanks!

hola aaron, disculpa has trabajado con cromatogramas? tengo algunas dudas me puedes ayudar?

모르겠어요. 죄송합니다!

Hi, It is really helpful. Thank you for posting it. I tried to plot the legend. But it didn’t work. Do you know how to do this?

Hi! It’s been a while since I wrote this; not sure – sorry!

its wonderful. Can I use this to generate a PCA for my SNPs data for a diploid plant species? If yes then what will be the data input format? Please suggest

Hi! You likely can, but I don’t know enough about SNPs data to be much help, I’m afraid!

Thanks,

But they are larger in number, about 23000 SNPs genotypes for 220 individuals.

Great Starter post!

cheers