The other day, a question was posted on RStudio Community about performing Principal Component Analysis (PCA) in a tidyverse workflow. Conveniently, I had literally just worked through this process the day before and was able to post an answer. While most questions and answers are good as they are on forum sites, I thought this one might be worth exploring a little more since using the tidyverse framework makes PCA much easier, in my opinion.
PCA is a multi-variate statistical technique for dimension reduction. Essentially, it allows you to take a data set that has n continuous variables and relate them through n orthogonal dimensions. This is a method of unsupervised learning that allows you to better understand the variability in the data set and how different variables are related. While there are the same number of principal components created as there are variables (assuming you have more observations than variables-but that is another issue), each principal component explains the maximum possible variation in the data conditional on it being orthogonal, or perpendicular, to the previous principal components.
In my answer, I used the
iris data set to demonstrate how PCA can be done in the tidyverse workflow. For this post, I will be using the
USArrests data set that was used in An Introduction to Statistical Thinking by Gareth James et. al. In this book, they work through a PCA and focus on the statistics and explanations behind PCA. This is how I learned how to do PCA and would highly recommend it if you are unfamiliar with the topic. In this blog post, my focus will be more on implementing the PCA in the tidyverse framework. Another nice walkthough of PCA with this dataset that is online can be found at University of Cincinnati’s R blog.
Exploring the data
Before we dive in to the analysis, we want to explore our data set and become familiar with it.
library(tidyverse) library(broom) library(knitr) library(ggfortify)
USArrests %>% head() %>% knitr::kable()
Looking at the first 6 rows using the
head() function, we can see that each row is a state and and each column is a variable. Looking at
?USArrests, we can see that the column descriptions are as follows:
- Murder - the number of murder arrests per 100,000 people in a given state
- Assault - the number of assault arrests per 100,000 people in a given state
- UrbanPop - a numeric percentage of the urban population per state (i.e. the percentage of the state’s population that lives in cities)
- Rape - the number of rape arrests per 100,000 people in a given state
Now, that we see how the data set is set up, let’s try to visualize the data as it is. Before we do that, let’s convert the data set to a
tibble. Since tibbles do not support rownames, we will have to convert them to their own column with
us_arrests <- USArrests %>% rownames_to_column(var = "state") %>% # I prefer column names to be all lowercase so I am going to change them here rename_all(tolower) %>% as_tibble() us_arrests
## # A tibble: 50 x 5 ## state murder assault urbanpop rape ## <chr> <dbl> <int> <int> <dbl> ## 1 Alabama 13.2 236 58 21.2 ## 2 Alaska 10.0 263 48 44.5 ## 3 Arizona 8.10 294 80 31.0 ## 4 Arkansas 8.80 190 50 19.5 ## 5 California 9.00 276 91 40.6 ## 6 Colorado 7.90 204 78 38.7 ## 7 Connecticut 3.30 110 77 11.1 ## 8 Delaware 5.90 238 72 15.8 ## 9 Florida 15.4 335 80 31.9 ## 10 Georgia 17.4 211 60 25.8 ## # ... with 40 more rows
Let’s take a look at murder arrest rates in each of the stats.
us_arrests %>% mutate(state = factor(state), state = fct_reorder(state, murder) %>% fct_rev()) %>% ggplot(aes(state, murder)) + geom_bar(stat = "identity") + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) + labs(y = "Murder Arrest Rate per 100,000 people", x = "State", title = "Murder rate in each state of the USA")
You can see that Georgia has the highest murder rate, followed by Mississippi, Louisiana, and Florida Let’s try one more.
us_arrests %>% gather(key = crime, value = rate, c(murder, assault, rape)) %>% ggplot(aes(urbanpop, rate, color = crime)) + facet_wrap(~crime, scales = "free_y", ncol = 1) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme_bw() + labs(x = "Percentage Urban Population", y = "Arrest Rate per 100,000 people", title = "Arrest rate vs percentage urban population") + theme(legend.title = element_blank(), legend.position = "bottom")
This figure is a slightly more informative than the last one. In this figure, each crime rate is plotted against percentage urban population. Simple linear models are fit for each of the different crime types to see if any pattern can be seen in the data. There appears to be an increase in assault arrests as urban population grows, however, there is a lot of variability around the line of best fit. Rape arrest rates follow the linear model much closer than the others but there is still a lot of variability. On the other hand, murder arrest rates seem to be unchanged based on urban population.
Now we could certainly do correlations, multiple linear regressions, or fit other types of models and would likely gain some useful insights, but instead let’s focus on the PCA.
PCA in the tidyverse framework
Now, when I first fit PCA models in R, I found myself with an unmanageable number of variables and to track and maintain. This can make the process overwhelming and can make you lose track of information. Luckily, using the
tidyverse and the
broom package, we can solve these issues much more easily. In order to run the model in the tidyverse framework, we will use the
nest() function along with the
map combo to operate on the nested columns.
us_arrests_pca <- us_arrests %>% nest() %>% mutate(pca = map(data, ~ prcomp(.x %>% select(-state), center = TRUE, scale = TRUE)), pca_aug = map2(pca, data, ~augment(.x, data = .y))) us_arrests_pca
## # A tibble: 1 x 3 ## data pca pca_aug ## <list> <list> <list> ## 1 <tibble [50 x 5]> <S3: prcomp> <data.frame [50 x 10]>
As you can see, we now have a tibble with one row and three columns. The first is our original data set, the second is the
prcomp object and the third is a data frame containing the principal component values for each observation. Now we have everything we need to evaluate the results of the model. First, it is important to look at how much variance is explained by each principal component. This will tell you how many of the components you need to look at when analyzing the results. To do this, you can use the data in the
pca_aug column of our
us_arrests_pca tibble along with some
var_exp <- us_arrests_pca %>% unnest(pca_aug) %>% summarize_at(.vars = vars(contains("PC")), .funs = funs(var)) %>% gather(key = pc, value = variance) %>% mutate(var_exp = variance/sum(variance), cum_var_exp = cumsum(var_exp), pc = str_replace(pc, ".fitted", "")) var_exp
## # A tibble: 4 x 4 ## pc variance var_exp cum_var_exp ## <chr> <dbl> <dbl> <dbl> ## 1 PC1 2.48 0.620 0.620 ## 2 PC2 0.990 0.247 0.868 ## 3 PC3 0.357 0.0891 0.957 ## 4 PC4 0.173 0.0434 1.00
We can see that the first principal component explains 62% of the variation and that the second principal component explains 25% of the remaining variation. Together, we can see that they explain 87% of the variance in the data set. As a general rule of thumb, you want to look at enough principal components to explain ~90% of your data’s variability.
Another way of assessing this is to plot the variance explained and the cumulative variance explained and look for the “elbow” in the graph. You can do that like this:
var_exp %>% rename( `Variance Explained` = var_exp, `Cumulative Variance Explained` = cum_var_exp ) %>% gather(key = key, value = value, `Variance Explained`:`Cumulative Variance Explained`) %>% ggplot(aes(pc, value, group = key)) + geom_point() + geom_line() + facet_wrap(~key, scales = "free_y") + theme_bw() + lims(y = c(0, 1)) + labs(y = "Variance", title = "Variance explained by each principal component")
Since the majority of the variance is explained by the first two principal components, let’s plot them against each other. Luckily, this is made easy by the
ggfortify packages which gives an
autoplot method for prcomp objects. Conveniently, we still have our
prcomp object stored in the our
us_arrests_pca tibble along with our original data. This is important as the original data is needed to add labels and/or colors to your ggplot based on discrete variables not included in the PCA. Using the same
map combo as before, along with the handy
dplyr::pull function, we can plot this graph.
autoplot.prcomp() can take any arguments that can be passed to
ggbiplot(), so to see all of your options use
us_arrests_pca %>% mutate( pca_graph = map2( .x = pca, .y = data, ~ autoplot(.x, loadings = TRUE, loadings.label = TRUE, loadings.label.repel = TRUE, data = .y, label = TRUE, label.label = "state", label.repel = TRUE) + theme_bw() + labs(x = "Principal Component 1", y = "Principal Component 2", title = "First two principal components of PCA on USArrests dataset") ) ) %>% pull(pca_graph)
If you notice the
[] above the graph, this method would automatically print each graph in the
pca_graph column. This means that if you fit multiple pca graphs onto a grouped data set then this would automatically print all of the figures for you in one command. Another important thing to notice is the the
autoplot() function simply returns a
gg object and that can be extended just like any other
Looking at this figure, we are getting a lot more information about how these variables are related. For example, we can see that murder, rape, and assault arrest rates all have similar values alone the first principal component, which indicates that they are correlated with one another. We may have been able to figure that out using more basic methods, but this method allows us to see it in one figure. In addition, we are able to see how each of the observations in the data (i.e. states) relates to the variables. For example, California is characterized by high arrest rates (1st principal component) and high urban population (2nd principal component).
PCA can provide an analysis with a lot of information about their data set and can give valuable insights into potentially unseen relationships between the observations and the variables. By performing this analysis in the tidyverse framework, we can easily extend this model by using the modelling capabilities of the tidyverse. You can see some of these here and here. I believe that this framework allows for much more flexibility in your analysis and how you use the data from the model, which will enable you to gather more from your data in a faster and convenient way.
Please let me know what you think of performing PCA in the tidyverse framework!