Using Scoped dplyr verbs

Introduction

Over the past several months, I have really started to increase the amount that I have been using scoped dplyr verbs. For those of you who don’t know about these functions, they are handy variants to the normal dplyr verbs, such as filter, mutate, and summarize, that allow you to target multiple columns or all of your columns. These functions allow for you to save yourself time and typing when you want to apply either one or multiple functions to more than one column, a group of columns, or to all of your columns. This post will walk through a few of the ones I use on a regular basis and how I use them!

These scoped verbs typically come in three different flavors:

  • _if - This allows you to target all columns that mean a specific condition
  • _at - This allows you to target specific columns by name
  • _all - As the name implies, this will apply a function to every column of the data set

Before we get started, let’s go ahead and load the libraries we will be using.

library(dplyr)
library(ggplot2)
library(tibble)
library(stringr)
library(gt) # for the sp500 dataset
library(janitor)

_if

Let’s first take a look at mutate_if by looking at an example where we want to convert factors to character variables. The data set we will be using for this example is diamonds in the ggplot2 package.

diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

As we can see, there are three columns of factors in the data set (ord is just an ordered factor) - cut, color, and clarity. If you weren’t using scoped verbs, then you would convert them with something like this.

diamonds %>% 
  mutate(
    cut = as.character(cut),
    color = as.character(color),
    clarity = as.character(clarity)
  )
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

While this certainly works, it is easy to see how this method can get out of hand rather quickly. Now with the scoped variant, it is much cleaner. You have to pass a predicate function that will return TRUE or FALSE for the column (e.g. is.factor) and then it will apply the function (e.g. as.character) to all columns that return TRUE from the predicate.

diamonds %>% 
  mutate_if(is.factor, as.character)
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

The conversion of factors (and other datatypes) is probably the thing I use mutate_if for the most, but you can use it for anything that has to be applied to all columns that meet certain conditions. If we wanted to add 10% of the mean of each numeric column to every value in that column (this is not practical, but just as an illustration) you could do the following.

diamonds %>% 
  mutate_if(is.numeric, list(~.+0.1*mean(., na.rm = TRUE)))
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.310 Ideal     E     SI2      67.7  60.7  719.  4.52  4.55  2.78
##  2 0.290 Premium   E     SI1      66.0  66.7  719.  4.46  4.41  2.66
##  3 0.310 Good      E     VS1      63.1  70.7  720.  4.62  4.64  2.66
##  4 0.370 Premium   I     VS2      68.6  63.7  727.  4.77  4.80  2.98
##  5 0.390 Good      J     SI2      69.5  63.7  728.  4.91  4.92  3.10
##  6 0.320 Very Good J     VVS2     69.0  62.7  729.  4.51  4.53  2.83
##  7 0.320 Very Good I     VVS1     68.5  62.7  729.  4.52  4.55  2.82
##  8 0.340 Very Good H     SI1      68.1  60.7  730.  4.64  4.68  2.88
##  9 0.300 Fair      E     VS2      71.3  66.7  730.  4.44  4.35  2.84
## 10 0.310 Very Good H     VS1      65.6  66.7  731.  4.57  4.62  2.74
## # ... with 53,930 more rows

NOTE: The notations for passing either anonymous functions (such as above) or multiple functions has changed in the release of dplyr 0.8.0. Previously the above code would read mutate_if(is.numeric, funs(.+0.1*mean(., na.rm=TRUE))). The funs function has been soft deprecated in the new release. This means that it can still be used but the newer implementation should be used as it will either no longer be supported or will be removed later. Running the mutate_if call with funs results in the following warning message:

Warning message:
funs() is soft deprecated as of dplyr 0.8.0
please use list() instead
# Before:
funs(name = f(.)
# After: 
list(name = ~f(.))
This warning is displayed once per session.

In this example, the original columns are modified to represent the new value. If you wanted to create new columns for all of the columns that this predicate function applies to, you can give the function a name in our list. The name of the function is appended to the name of every column that it applies to with a _ as a separator.

diamonds %>% 
  mutate_if(is.numeric, list("new" = ~.+0.1*mean(., na.rm = TRUE)))
## # A tibble: 53,940 x 17
##    carat cut   color clarity depth table price     x     y     z carat_new
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>     <dbl>
##  1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43     0.310
##  2 0.21  Prem~ E     SI1      59.8    61   326  3.89  3.84  2.31     0.290
##  3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07  2.31     0.310
##  4 0.290 Prem~ I     VS2      62.4    58   334  4.2   4.23  2.63     0.370
##  5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75     0.390
##  6 0.24  Very~ J     VVS2     62.8    57   336  3.94  3.96  2.48     0.320
##  7 0.24  Very~ I     VVS1     62.3    57   336  3.95  3.98  2.47     0.320
##  8 0.26  Very~ H     SI1      61.9    55   337  4.07  4.11  2.53     0.340
##  9 0.22  Fair  E     VS2      65.1    61   337  3.87  3.78  2.49     0.300
## 10 0.23  Very~ H     VS1      59.4    61   338  4     4.05  2.39     0.310
## # ... with 53,930 more rows, and 6 more variables: depth_new <dbl>,
## #   table_new <dbl>, price_new <dbl>, x_new <dbl>, y_new <dbl>,
## #   z_new <dbl>

Similar to mutate_if, summarize_if/summarise_if works by allowing you to select all the columns that meet a certain condition and summarizing those columns with a given function. It should be noted that in this case, just as with summarize, a function has to be provided that will return only a single value. If you would like to see a way around that requirement, using purrr and scoped verbs, you can see here.

Let’s see how summarize_if works:

diamonds %>% 
  summarize_if(is.numeric, list("mean" = mean, "median" = median))
## # A tibble: 1 x 14
##   carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
##        <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
## 1      0.798       61.7       57.5      3933.   5.73   5.73   3.54
## # ... with 7 more variables: carat_median <dbl>, depth_median <dbl>,
## #   table_median <dbl>, price_median <dbl>, x_median <dbl>,
## #   y_median <dbl>, z_median <dbl>

We can also create more custom predicates to be used for our _if functions. Let’s create one that returns whether a column is numeric and has at least one value higher than 50.

higher_fifty <- function(x){
  if (is.numeric(x)){
    return(any(x > 50))
  } else {
    return(FALSE)
  }
}


diamonds %>% 
  summarize_if(higher_fifty, list("mean" = mean))
## # A tibble: 1 x 4
##   depth_mean table_mean price_mean y_mean
##        <dbl>      <dbl>      <dbl>  <dbl>
## 1       61.7       57.5      3933.   5.73

Another useful _if variant is select_if. You may want to select all of the numeric columns in the diamonds data set for further analysis, and select_if is perfect for this case. The implementation is the same as mutate_if and summarize_if, just instead of specifying a function to apply to the selected columns, you only specify a predicate function.

diamonds %>% 
  select_if(is.numeric)
## # A tibble: 53,940 x 7
##    carat depth table price     x     y     z
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23   61.5    55   326  3.95  3.98  2.43
##  2 0.21   59.8    61   326  3.89  3.84  2.31
##  3 0.23   56.9    65   327  4.05  4.07  2.31
##  4 0.290  62.4    58   334  4.2   4.23  2.63
##  5 0.31   63.3    58   335  4.34  4.35  2.75
##  6 0.24   62.8    57   336  3.94  3.96  2.48
##  7 0.24   62.3    57   336  3.95  3.98  2.47
##  8 0.26   61.9    55   337  4.07  4.11  2.53
##  9 0.22   65.1    61   337  3.87  3.78  2.49
## 10 0.23   59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

The last scoped verb for the _if variants is filter. filter_if is slightly different than the rest of the _if variants, because in addition to operating on columns in the data frame based on a condition, it can also operate on the rows of the data frame based on a condition. This row-wise operation is handled by the .vars_predicate argument in the scoped filter verbs. This argument is used in conjunction with the helper predicate functions all_vars and any_vars.

For this example, we will use the sp500 data set from the gt package. Let’s say that we want to filter all of the rows that did not have a value of greater than $2000 for the entire day. We can do that like this.

gt::sp500 %>% 
  filter_if(is.numeric, all_vars(. > 2000))
## # A tibble: 249 x 7
##    date        open  high   low close     volume adj_close
##    <date>     <dbl> <dbl> <dbl> <dbl>      <dbl>     <dbl>
##  1 2015-12-31 2061. 2063. 2044. 2044. 2655330000     2044.
##  2 2015-12-30 2077. 2077. 2062. 2063. 2367430000     2063.
##  3 2015-12-29 2061. 2082. 2061. 2078. 2542000000     2078.
##  4 2015-12-28 2058. 2058. 2044. 2056. 2492510000     2056.
##  5 2015-12-24 2064. 2067. 2059. 2061. 1411860000     2061.
##  6 2015-12-23 2042. 2065. 2042. 2064. 3484090000     2064.
##  7 2015-12-22 2023. 2043. 2020. 2039. 3520860000     2039.
##  8 2015-12-21 2010. 2023. 2006. 2021. 3760280000     2021.
##  9 2015-12-18 2041. 2041. 2005. 2006. 6683070000     2006.
## 10 2015-12-17 2074. 2076. 2042. 2042. 4327390000     2042.
## # ... with 239 more rows

all_vars requires that all the columns returning TRUE from the predicate meet the filter requirements. Conversely, any_vars requires that only one of the columns meets the specified requirements.

_at

Now we can take a look at another variant of the dplyr verbs that allows us to target specific columns, _at. These functions are super handy when you want to apply a function to numerous columns by name. For this example, lets use the ever useful mtcars data set with the row names moved to a column named cars. We will assign this modified tibble to cars_tbl.

cars_tbl <- mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble()

Let’s say that we want to normalize the mpg, hp, and drat columns from zero to one. We can do that by writing a simple function and applying it to each column, like this.

norm_dat <- function(x){
  (x-min(x))/(max(x)-min(x))
}

cars_tbl %>% 
  mutate(
    mpg = norm_dat(mpg),
    hp = norm_dat(hp),
    drat = norm_dat(drat)
  )
## # A tibble: 32 x 12
##    car     mpg   cyl  disp     hp  drat    wt  qsec    vs    am  gear  carb
##    <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazd~ 0.451     6  160  0.205  0.525  2.62  16.5     0     1     4     4
##  2 Mazd~ 0.451     6  160  0.205  0.525  2.88  17.0     0     1     4     4
##  3 Dats~ 0.528     4  108  0.145  0.502  2.32  18.6     1     1     4     1
##  4 Horn~ 0.468     6  258  0.205  0.147  3.22  19.4     1     0     3     1
##  5 Horn~ 0.353     8  360  0.435  0.180  3.44  17.0     0     0     3     2
##  6 Vali~ 0.328     6  225  0.187  0      3.46  20.2     1     0     3     1
##  7 Dust~ 0.166     8  360  0.682  0.207  3.57  15.8     0     0     3     4
##  8 Merc~ 0.596     4  147. 0.0353 0.429  3.19  20       1     0     4     2
##  9 Merc~ 0.528     4  141. 0.152  0.535  3.15  22.9     1     0     4     2
## 10 Merc~ 0.374     6  168. 0.251  0.535  3.44  18.3     1     0     4     4
## # ... with 22 more rows

Just like with the previous example of converting factors to characters, this certainly works but can quickly become cumbersome when applying the same function to multiple columns in a data set. mutate_at allows you to specifically target columns to apply a function to.

cars_tbl%>% 
  mutate_at(vars(mpg, hp, drat), list(~norm_dat))
## # A tibble: 32 x 12
##    car     mpg   cyl  disp     hp  drat    wt  qsec    vs    am  gear  carb
##    <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazd~ 0.451     6  160  0.205  0.525  2.62  16.5     0     1     4     4
##  2 Mazd~ 0.451     6  160  0.205  0.525  2.88  17.0     0     1     4     4
##  3 Dats~ 0.528     4  108  0.145  0.502  2.32  18.6     1     1     4     1
##  4 Horn~ 0.468     6  258  0.205  0.147  3.22  19.4     1     0     3     1
##  5 Horn~ 0.353     8  360  0.435  0.180  3.44  17.0     0     0     3     2
##  6 Vali~ 0.328     6  225  0.187  0      3.46  20.2     1     0     3     1
##  7 Dust~ 0.166     8  360  0.682  0.207  3.57  15.8     0     0     3     4
##  8 Merc~ 0.596     4  147. 0.0353 0.429  3.19  20       1     0     4     2
##  9 Merc~ 0.528     4  141. 0.152  0.535  3.15  22.9     1     0     4     2
## 10 Merc~ 0.374     6  168. 0.251  0.535  3.44  18.3     1     0     4     4
## # ... with 22 more rows

We can see how handy this can become and how much time this can save you if you are repeating the same operation on numerous columns within a tibble. An second, but equally advantageous, use of mutate_at is the deselection of columns to which a function should be applied. Say that we want to apply our normalization function to every column except car, vs and am, since they are binary columns. To do this we would use the same methods as your would if you were removing a column with select.

cars_tbl %>% 
  mutate_at(vars(-c(car, vs, am)), list(~norm_dat))
## # A tibble: 32 x 12
##    car     mpg   cyl   disp     hp  drat    wt  qsec    vs    am  gear
##    <chr> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Mazd~ 0.451   0.5 0.222  0.205  0.525 0.283 0.233     0     1   0.5
##  2 Mazd~ 0.451   0.5 0.222  0.205  0.525 0.348 0.3       0     1   0.5
##  3 Dats~ 0.528   0   0.0920 0.145  0.502 0.206 0.489     1     1   0.5
##  4 Horn~ 0.468   0.5 0.466  0.205  0.147 0.435 0.588     1     0   0  
##  5 Horn~ 0.353   1   0.721  0.435  0.180 0.493 0.3       0     0   0  
##  6 Vali~ 0.328   0.5 0.384  0.187  0     0.498 0.681     1     0   0  
##  7 Dust~ 0.166   1   0.721  0.682  0.207 0.526 0.160     0     0   0  
##  8 Merc~ 0.596   0   0.189  0.0353 0.429 0.429 0.655     1     0   0.5
##  9 Merc~ 0.528   0   0.174  0.152  0.535 0.419 1         1     0   0.5
## 10 Merc~ 0.374   0.5 0.241  0.251  0.535 0.493 0.452     1     0   0.5
## # ... with 22 more rows, and 1 more variable: carb <dbl>

Now let’s say that we want to get the mean, sd, median, and count of all values greater than the mean for the mpg, hp, and drat columns. We can do that using the summarize_at function.

cars_summary <- cars_tbl %>% 
  summarize_at(vars(mpg, hp, drat), list("mean" = mean, "sd" = sd, "median" = median, "n_higher_half" = ~sum(. > mean(.))))

cars_summary
## # A tibble: 1 x 12
##   mpg_mean hp_mean drat_mean mpg_sd hp_sd drat_sd mpg_median hp_median
##      <dbl>   <dbl>     <dbl>  <dbl> <dbl>   <dbl>      <dbl>     <dbl>
## 1     20.1    147.      3.60   6.03  68.6   0.535       19.2       123
## # ... with 4 more variables: drat_median <dbl>, mpg_n_higher_half <int>,
## #   hp_n_higher_half <int>, drat_n_higher_half <int>

The above example demonstrates how you can simply pass a function name to be applied to the column, and how you can pass slightly more complex functions to the .funs argument using . notation.

_all

The _all variant works similarly to the other two, just now we are by default targeting all of the columns in the data frame. This can be extremely useful if you want to perform transformations on all of the columns in your data set or get summary variables for all of the columns. One place this may be useful is if you are fitting multivariate models and want to normalize all of your variables. Let’s show how these functions work with a quick example of both mutate_all and summarize_all.

First, if we want to normalize all of the columns in a data set, we can do that by applying the norm_dat function defined above with mutate_all. Let’s see what this looks like using the USArrests data set.

USArrests %>%
  mutate_all(norm_dat)
##        Murder     Assault  UrbanPop       Rape
## 1  0.74698795 0.654109589 0.4406780 0.35917313
## 2  0.55421687 0.746575342 0.2711864 0.96124031
## 3  0.43975904 0.852739726 0.8135593 0.61240310
## 4  0.48192771 0.496575342 0.3050847 0.31524548
## 5  0.49397590 0.791095890 1.0000000 0.86046512
## 6  0.42771084 0.544520548 0.7796610 0.81136951
## 7  0.15060241 0.222602740 0.7627119 0.09819121
## 8  0.30722892 0.660958904 0.6779661 0.21963824
## 9  0.87951807 0.993150685 0.8135593 0.63565891
## 10 1.00000000 0.568493151 0.4745763 0.47803618
## 11 0.27108434 0.003424658 0.8644068 0.33333333
## 12 0.10843373 0.256849315 0.3728814 0.17829457
## 13 0.57831325 0.698630137 0.8644068 0.43152455
## 14 0.38554217 0.232876712 0.5593220 0.35400517
## 15 0.08433735 0.037671233 0.4237288 0.10335917
## 16 0.31325301 0.239726027 0.5762712 0.27648579
## 17 0.53614458 0.219178082 0.3389831 0.23255814
## 18 0.87951807 0.698630137 0.5762712 0.38501292
## 19 0.07831325 0.130136986 0.3220339 0.01291990
## 20 0.63253012 0.873287671 0.5932203 0.52971576
## 21 0.21686747 0.356164384 0.8983051 0.23255814
## 22 0.68072289 0.719178082 0.7118644 0.71834625
## 23 0.11445783 0.092465753 0.5762712 0.19638243
## 24 0.92168675 0.732876712 0.2033898 0.25322997
## 25 0.49397590 0.455479452 0.6440678 0.54005168
## 26 0.31325301 0.219178082 0.3559322 0.23514212
## 27 0.21084337 0.195205479 0.5084746 0.23772610
## 28 0.68674699 0.708904110 0.8305085 1.00000000
## 29 0.07831325 0.041095890 0.4067797 0.05684755
## 30 0.39759036 0.390410959 0.9661017 0.29715762
## 31 0.63855422 0.821917808 0.6440678 0.64082687
## 32 0.62048193 0.715753425 0.9152542 0.48578811
## 33 0.73493976 1.000000000 0.2203390 0.22739018
## 34 0.00000000 0.000000000 0.2033898 0.00000000
## 35 0.39156627 0.256849315 0.7288136 0.36434109
## 36 0.34939759 0.363013699 0.6101695 0.32816537
## 37 0.24698795 0.390410959 0.5932203 0.56847545
## 38 0.33132530 0.208904110 0.6779661 0.19638243
## 39 0.15662651 0.441780822 0.9322034 0.02583979
## 40 0.81927711 0.801369863 0.2711864 0.39276486
## 41 0.18072289 0.140410959 0.2203390 0.14211886
## 42 0.74698795 0.489726027 0.4576271 0.50645995
## 43 0.71686747 0.534246575 0.8135593 0.47028424
## 44 0.14457831 0.256849315 0.8135593 0.40310078
## 45 0.08433735 0.010273973 0.0000000 0.10077519
## 46 0.46385542 0.380136986 0.5254237 0.34625323
## 47 0.19277108 0.342465753 0.6949153 0.48837209
## 48 0.29518072 0.123287671 0.1186441 0.05167959
## 49 0.10843373 0.027397260 0.5762712 0.09043928
## 50 0.36144578 0.397260274 0.4745763 0.21447028

It is that easy! We can also create new names for the mutated columns in the same manner that was shown in the _if section.

USArrests %>% 
  mutate_all(list("norm" = norm_dat))
##    Murder Assault UrbanPop Rape Murder_norm Assault_norm UrbanPop_norm
## 1    13.2     236       58 21.2  0.74698795  0.654109589     0.4406780
## 2    10.0     263       48 44.5  0.55421687  0.746575342     0.2711864
## 3     8.1     294       80 31.0  0.43975904  0.852739726     0.8135593
## 4     8.8     190       50 19.5  0.48192771  0.496575342     0.3050847
## 5     9.0     276       91 40.6  0.49397590  0.791095890     1.0000000
## 6     7.9     204       78 38.7  0.42771084  0.544520548     0.7796610
## 7     3.3     110       77 11.1  0.15060241  0.222602740     0.7627119
## 8     5.9     238       72 15.8  0.30722892  0.660958904     0.6779661
## 9    15.4     335       80 31.9  0.87951807  0.993150685     0.8135593
## 10   17.4     211       60 25.8  1.00000000  0.568493151     0.4745763
## 11    5.3      46       83 20.2  0.27108434  0.003424658     0.8644068
## 12    2.6     120       54 14.2  0.10843373  0.256849315     0.3728814
## 13   10.4     249       83 24.0  0.57831325  0.698630137     0.8644068
## 14    7.2     113       65 21.0  0.38554217  0.232876712     0.5593220
## 15    2.2      56       57 11.3  0.08433735  0.037671233     0.4237288
## 16    6.0     115       66 18.0  0.31325301  0.239726027     0.5762712
## 17    9.7     109       52 16.3  0.53614458  0.219178082     0.3389831
## 18   15.4     249       66 22.2  0.87951807  0.698630137     0.5762712
## 19    2.1      83       51  7.8  0.07831325  0.130136986     0.3220339
## 20   11.3     300       67 27.8  0.63253012  0.873287671     0.5932203
## 21    4.4     149       85 16.3  0.21686747  0.356164384     0.8983051
## 22   12.1     255       74 35.1  0.68072289  0.719178082     0.7118644
## 23    2.7      72       66 14.9  0.11445783  0.092465753     0.5762712
## 24   16.1     259       44 17.1  0.92168675  0.732876712     0.2033898
## 25    9.0     178       70 28.2  0.49397590  0.455479452     0.6440678
## 26    6.0     109       53 16.4  0.31325301  0.219178082     0.3559322
## 27    4.3     102       62 16.5  0.21084337  0.195205479     0.5084746
## 28   12.2     252       81 46.0  0.68674699  0.708904110     0.8305085
## 29    2.1      57       56  9.5  0.07831325  0.041095890     0.4067797
## 30    7.4     159       89 18.8  0.39759036  0.390410959     0.9661017
## 31   11.4     285       70 32.1  0.63855422  0.821917808     0.6440678
## 32   11.1     254       86 26.1  0.62048193  0.715753425     0.9152542
## 33   13.0     337       45 16.1  0.73493976  1.000000000     0.2203390
## 34    0.8      45       44  7.3  0.00000000  0.000000000     0.2033898
## 35    7.3     120       75 21.4  0.39156627  0.256849315     0.7288136
## 36    6.6     151       68 20.0  0.34939759  0.363013699     0.6101695
## 37    4.9     159       67 29.3  0.24698795  0.390410959     0.5932203
## 38    6.3     106       72 14.9  0.33132530  0.208904110     0.6779661
## 39    3.4     174       87  8.3  0.15662651  0.441780822     0.9322034
## 40   14.4     279       48 22.5  0.81927711  0.801369863     0.2711864
## 41    3.8      86       45 12.8  0.18072289  0.140410959     0.2203390
## 42   13.2     188       59 26.9  0.74698795  0.489726027     0.4576271
## 43   12.7     201       80 25.5  0.71686747  0.534246575     0.8135593
## 44    3.2     120       80 22.9  0.14457831  0.256849315     0.8135593
## 45    2.2      48       32 11.2  0.08433735  0.010273973     0.0000000
## 46    8.5     156       63 20.7  0.46385542  0.380136986     0.5254237
## 47    4.0     145       73 26.2  0.19277108  0.342465753     0.6949153
## 48    5.7      81       39  9.3  0.29518072  0.123287671     0.1186441
## 49    2.6      53       66 10.8  0.10843373  0.027397260     0.5762712
## 50    6.8     161       60 15.6  0.36144578  0.397260274     0.4745763
##     Rape_norm
## 1  0.35917313
## 2  0.96124031
## 3  0.61240310
## 4  0.31524548
## 5  0.86046512
## 6  0.81136951
## 7  0.09819121
## 8  0.21963824
## 9  0.63565891
## 10 0.47803618
## 11 0.33333333
## 12 0.17829457
## 13 0.43152455
## 14 0.35400517
## 15 0.10335917
## 16 0.27648579
## 17 0.23255814
## 18 0.38501292
## 19 0.01291990
## 20 0.52971576
## 21 0.23255814
## 22 0.71834625
## 23 0.19638243
## 24 0.25322997
## 25 0.54005168
## 26 0.23514212
## 27 0.23772610
## 28 1.00000000
## 29 0.05684755
## 30 0.29715762
## 31 0.64082687
## 32 0.48578811
## 33 0.22739018
## 34 0.00000000
## 35 0.36434109
## 36 0.32816537
## 37 0.56847545
## 38 0.19638243
## 39 0.02583979
## 40 0.39276486
## 41 0.14211886
## 42 0.50645995
## 43 0.47028424
## 44 0.40310078
## 45 0.10077519
## 46 0.34625323
## 47 0.48837209
## 48 0.05167959
## 49 0.09043928
## 50 0.21447028

Now, if we wanted to summarize all of these columns, it would look like this.

USArrests %>% 
  summarize_all(list(mean = mean, median = median))
##   Murder_mean Assault_mean UrbanPop_mean Rape_mean Murder_median
## 1       7.788       170.76         65.54    21.232          7.25
##   Assault_median UrbanPop_median Rape_median
## 1            159              66        20.1

Again, it is that easy!

Now, since the implementation is fairly similar to the other scoped variables, I won’t belabor the point. Instead, I will leave you will one other useful tool from the _all variants that I find helpful. Ordinarily, I prefer to work with my variable names in snake case. Luckily, the janitor package provides a great function, clean_names, to convert column names to all kinds of formats, snake case included. However, I find that when I want to share data with people not working in R, such as my manager, she does not want to see column names in snake case. Unfortunately, as far as I know, there is not a simply function, at this point, to convert R friendly column names back to title case. To accomplish this easily, I use the rename_all function. Let me know you an example with the iris data set after I have converted it to snake case with janitor::clean_names.

iris_tbl <- iris %>% 
  as_tibble() %>% 
  janitor::clean_names()

iris_tbl
## # A tibble: 150 x 5
##    sepal_length sepal_width petal_length petal_width species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ... with 140 more rows

As you can see, now the names are super R-friendly. However, we want them to be converted back to title case without the underscores. Let’s see how that can be done.

iris_tbl %>% 
  rename_all(list(~stringr::str_to_title(stringr::str_replace(., "_", " ")))) %>% 
  head() %>% 
  knitr::kable()
Sepal Length Sepal Width Petal Length Petal Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

This, in my opinion, looks much nicer in a table that you are distributing outside of the R world.

Conclusion

The scoped verbs in the dplyr package are just one more example of why this package is so useful. These functions allow you to apply other functions to your data set across numerous columns without repeating yourself and have the potential to greatly speed up your workflow and reduce the amount of typing that is required if you start to use them in your code.

This post by no means covers all of the scoped verbs that are available, but rather just gives you a taste of how you implement different versions of them. If you have any questions or comments, please share them below!

 
comments powered by Disqus