Recently, I was trying to calculate the percentiles of a set of variables within a data set grouped by another variable. However, I quickly ran into the realization that this is not very straight forward when using dplyr
’s summarize
. Before I demonstrate, let’s load the libraries that we will need.
library(dplyr)
library(purrr)
If you don’t believe me when I say that it is not straight forward, go ahead and try to run the following block of code.
mtcars %>%
dplyr::group_by(cyl) %>%
dplyr::summarize(quants = quantile(mpg, probs = c(0.2, 0.5, 0.8)))
If you ran the code, you will see that it throws the following error:
Error in summarise_impl(.data, dots) :
Column `quants` must be length 1 (a summary value), not 3
This error is telling us that the result is returning an object of length 3 (our three quantiles) when it is expecting to get only one value. A quick Google search comes up with numerous stack overflow questions and answers about this. Most of these solutions revolve around using the do
function to calculate the quantiles on each of the groups. However, according to Hadley, do
will eventually be “going away”. While there is no definite time frame on this, I try to use it as little as possible. The new recommended practice is a combination of tidyr::nest
, dplyr::mutate
and purrr::map
for most cases of grouping. I love this approach for most things (and it is even the accepted for one of the SO questions mentioned above) but I worked up a new solution that I think is useful for calculating percentiles on multiple groups for any desired number of percentiles.
This method uses purrr::map
and a Function Operator, purrr::partial
, to create a list of functions that can than be applied to a data set using dplyr::summarize_at
and a little magic from rlang
.
Let’s start by creating a vector of the desired percentiles to calculate. In this example, we will calculate the 20th, 50th, and 80th percentiles.
p <- c(0.2, 0.5, 0.8)
Now we can create a list of functions, with one for each quantile, using purrr::map
and purrr::partial
. We can also assign names to each function (useful for the output of summarize
) using purrr::set_names
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
p_funs
## $`20%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf50757430>
##
## $`50%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf50762c30>
##
## $`80%`
## function (...)
## quantile(probs = .x, na.rm = TRUE, ...)
## <environment: 0x7fcf51148830>
Looking at p_funs
we can see that we have a named list with each element containing a function comprised of the quantile
function. The beauty of this is that you can use this list in the same way you would define multiple functions in any other summarize_at
or summarize_all
functions (i.e. funs(mean, sd)
). The only difference is that we will now have to use the “bang-bang-bang” operator (!!!
) from rlang
(it is also exported from dplyr
). The final product looks like this.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))
## # A tibble: 3 x 4
## cyl `20%` `50%` `80%`
## <dbl> <dbl> <dbl> <dbl>
## 1 4 22.8 26 30.4
## 2 6 18.3 19.7 21
## 3 8 13.9 15.2 16.8
I think that this provides a pretty neat way to get the desired output in a format that does not require a large amount of post calculation manipulation. In addition, it is, in my opinion, more straightforward than a lot of the do
methods. This method also allows for quantiles to be calculated for more than one variable, although post-processing would be necessary in that case. Here is an example.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg, hp), funs(!!!p_funs)) %>%
select(cyl, contains("mpg"), contains("hp"))
## # A tibble: 3 x 7
## cyl `mpg_20%` `mpg_50%` `mpg_80%` `hp_20%` `hp_50%` `hp_80%`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 22.8 26 30.4 65 91 97
## 2 6 18.3 19.7 21 110 110 123
## 3 8 13.9 15.2 16.8 175 192. 245
partial
is yet another tool from the purrr
package that can greatly enhance your R coding abilities. While this is surely a basic application of its functionality, one can easily see how powerful this function can be.