Advantages of dplyr::across( ) to Apply a Function Across Multiple Columns

Fahim Ahmad | 2021-05-15

At some point in your life you have probably found yourself repeating the same function across multiple columns. Although there isn’t anything inherently wrong with doing this, sometimes you might want to simplify and shorten your code to have a neat script.

In R, the across() function allows you to do data transformation or statistical analysis across all or a subset of columns of the data frame.

I am using the iris data as an example in this post. Let’s have a glance at it, it has five columns - 4 numeric and a factor column.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s say we want to find out the average of all numeric columns.

One way of doing this is to feed the mean() with each column or combine the summarise() and mean() functions to have the result as a data frame.

library(dplyr)

# mean(iris$Sepal.Length)
# mean(iris$Sepal.Width)
# mean(iris$Petal.Length)
# mean(iris$Petal.Width)

iris %>% 
  summarise(
    Sepal.Length_Mean = mean(Sepal.Length),
    Sepal.Width_Mean = mean(Sepal.Width),
    Petal.Length_Mean = mean(Petal.Length),
    Petal.Width_Mean = mean(Petal.Width)
  )
##   Sepal.Length_Mean Sepal.Width_Mean Petal.Length_Mean Petal.Width_Mean
## 1          5.843333         3.057333             3.758         1.199333

With the help of across() you can achieve the same result using the below chunk of code:

iris %>% 
  summarise(
    across(
      .cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
      .fns = mean
    )
    )
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333

Advantages of using across function

My first impression while using the across() function was “So what?what is the advantage of using this approach if I need to write all columns names which I want to apply the function on them, and I am sure if you are new to R you might have thought the same when using this function for the first time.

Below are some reasons I think why we should use the across() function while performing data transformation or providing the same summary statistics of multiple columns instead of repeating the same function multiple times.

1) .names argument

You can use the .names argument to control the output names that will result from the .fns operation. The .names argument can take two special keywords, “{col}” and “{fn}”

The default value is “{col}”, the names of the resulting columns will be identical to the column names that were passed in the “.cols” argument.

iris %>% 
  summarise(
    across(
      .cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
      .fns = mean,
      .names = "{col}"
    )
    )
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333

Let’s modify the names of the columns in a way that the names of the resulting columns should have the “_mean” as a suffix.

iris %>% 
  summarise(
    across(
      .cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
      .fns = mean,
      .names = "{col}_mean"
    )
    )
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1          5.843333         3.057333             3.758         1.199333

The special “{fn}” keyword comes in handy when you want to modify the column names in a way that the resulting data frame should have the same columns names and function name regardless of which function is used.

# a named list of functions should be passed to the .fns argument

iris %>% 
  summarise(
    across(
      .cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
      .fns = list(mean = mean, median = median),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_mean Sepal.Length_median Sepal.Width_mean Sepal.Width_median Petal.Length_mean
## 1          5.843333                 5.8         3.057333                  3             3.758
##   Petal.Length_median Petal.Width_mean Petal.Width_median
## 1                4.35         1.199333                1.3

2) select helpers

You can use the advantages of select helpers with the across() function to select a set of columns in different ways rather than passing the column names using a vector notation c(column1, column2, column3, ...)

  • starts_with() applies the function to the column whose names start with a specific character.
iris %>% 
  summarise(
    across(
      .cols = starts_with("Sepal"),
      .fns = list(mean = mean),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_mean Sepal.Width_mean
## 1          5.843333         3.057333
  • ends_with() applies the function to the column whose names end with a specific character.
iris %>% 
  summarise(
    across(
      .cols = ends_with("Width"),
      .fns = list(mean = mean),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Width_mean Petal.Width_mean
## 1         3.057333         1.199333
  • contains() applies the function to the column whose names contain a specific character.
iris %>% 
  summarise(
    across(
      .cols = contains("."),
      .fns = list(mean = mean),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1          5.843333         3.057333             3.758         1.199333
  • where() apply the function to all columns that satisfy a logical condition.
iris %>% 
  summarise(
    across(
      .cols = where(is.numeric),
      .fns = list(mean = mean),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1          5.843333         3.057333             3.758         1.199333
  • You can also select the column names by their index
iris %>% 
  summarise(
    across(
      .cols = 1:4,
      .fns = list(mean = mean),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1          5.843333         3.057333             3.758         1.199333

3) multiple functions at once

Another significant advantage of using the across() function is that you can pass multiple functions to multiple columns at once. For instance, let’s say you want to find out the minimum, mean, maximum, and standard deviation of all numeric columns in the data.

iris %>% 
  summarise(
    across(
      .cols = 1:4,
      .fns = list(
        Min = min,
        Mean = mean,
        Max = max,
        sd = sd
      ),
      .names = "{col}_{fn}"
    )
    )
##   Sepal.Length_Min Sepal.Length_Mean Sepal.Length_Max Sepal.Length_sd Sepal.Width_Min
## 1              4.3          5.843333              7.9       0.8280661               2
##   Sepal.Width_Mean Sepal.Width_Max Sepal.Width_sd Petal.Length_Min Petal.Length_Mean
## 1         3.057333             4.4      0.4358663                1             3.758
##   Petal.Length_Max Petal.Length_sd Petal.Width_Min Petal.Width_Mean Petal.Width_Max Petal.Width_sd
## 1              6.9        1.765298             0.1         1.199333             2.5      0.7622377