Calculating quantiles for groups with dplyr::summarize and purrr::partial

01 Oct 2018

rstats / dplyr / purrr

Recently, I was trying to calculate the percentiles of a set of variables within a data set grouped by another variable. However, I quickly ran into the realization that this is not very straight forward when using dplyr’s summarize. Before I demonstrate, let’s load the libraries that we will need.

library(dplyr)
library(purrr)

If you don’t believe me when I say that it is not straight forward, go ahead and try to run the following block of code.

mtcars %>% 
  dplyr::group_by(cyl) %>% 
  dplyr::summarize(quants = quantile(mpg, probs = c(0.2, 0.5, 0.8)))

If you ran the code, you will see that it throws the following error:

Error in summarise_impl(.data, dots) : 
  Column `quants` must be length 1 (a summary value), not 3

This error is telling us that the result is returning an object of length 3 (our three quantiles) when it is expecting to get only one value. A quick Google search comes up with numerous stack overflow questions and answers about this. Most of these solutions revolve around using the do function to calculate the quantiles on each of the groups. However, according to Hadley, do will eventually be “going away”. While there is no definite time frame on this, I try to use it as little as possible. The new recommended practice is a combination of tidyr::nest, dplyr::mutate and purrr::map for most cases of grouping. I love this approach for most things (and it is even the accepted for one of the SO questions mentioned above) but I worked up a new solution that I think is useful for calculating percentiles on multiple groups for any desired number of percentiles.

This method uses purrr::map and a Function Operator, purrr::partial, to create a list of functions that can than be applied to a data set using dplyr::summarize_at and a little magic from rlang.

Let’s start by creating a vector of the desired percentiles to calculate. In this example, we will calculate the 20^th, 50^th, and 80^th percentiles.

p <- c(0.2, 0.5, 0.8)

Now we can create a list of functions, with one for each quantile, using purrr::map and purrr::partial. We can also assign names to each function (useful for the output of summarize) using purrr::set_names

p_names <- map_chr(p, ~paste0(.x*100, "%"))

p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

p_funs

## $`20%`
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf50757430>
## 
## $`50%`
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf50762c30>
## 
## $`80%`
## function (...) 
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf51148830>

Looking at p_funs we can see that we have a named list with each element containing a function comprised of the quantile function. The beauty of this is that you can use this list in the same way you would define multiple functions in any other summarize_at or summarize_all functions (i.e. funs(mean, sd)). The only difference is that we will now have to use the “bang-bang-bang” operator (!!!) from rlang (it is also exported from dplyr). The final product looks like this.

mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), funs(!!!p_funs))

## # A tibble: 3 x 4
##     cyl `20%` `50%` `80%`
##   <dbl> <dbl> <dbl> <dbl>
## 1     4  22.8  26    30.4
## 2     6  18.3  19.7  21  
## 3     8  13.9  15.2  16.8

I think that this provides a pretty neat way to get the desired output in a format that does not require a large amount of post calculation manipulation. In addition, it is, in my opinion, more straightforward than a lot of the do methods. This method also allows for quantiles to be calculated for more than one variable, although post-processing would be necessary in that case. Here is an example.

mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, hp), funs(!!!p_funs)) %>% 
  select(cyl, contains("mpg"), contains("hp"))

## # A tibble: 3 x 7
##     cyl `mpg_20%` `mpg_50%` `mpg_80%` `hp_20%` `hp_50%` `hp_80%`
##   <dbl>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
## 1     4      22.8      26        30.4       65      91        97
## 2     6      18.3      19.7      21        110     110       123
## 3     8      13.9      15.2      16.8      175     192.      245

partial is yet another tool from the purrr package that can greatly enhance your R coding abilities. While this is surely a basic application of its functionality, one can easily see how powerful this function can be.