Using dplyr::across( ) with Anonymous Function

Fahim Ahmad | 2021-07-20

In my previous post, I briefly described the advantages of using across() while applying the same function over multiple columns. In this post, I provide an overview of using across() with the anonymous function.

I simulate the below data frame to use as an example here.

set.seed(1000)
n <- 10
df <- data.frame(
  age = c(sample(18:30, n, replace = T), 80),
  income = c(sample(2000:3000, n, replace = T), 8000),
  education_level = c(sample(12:20, n, replace = T), 25),
  gender = c(sample(c("Male", "Female"), n, replace = T), "Male")
)
df

##    age income education_level gender
## 1   21   2357              13   Male
## 2   28   2300              16   Male
## 3   23   2421              14   Male
## 4   20   2928              20 Female
## 5   25   2680              18   Male
## 6   30   2540              19   Male
## 7   20   2665              12 Female
## 8   30   2441              17 Female
## 9   19   2379              19   Male
## 10  23   2438              19 Female
## 11  80   8000              25   Male

For instance, let’s say you decide to replace the outliers in numeric columns with the mean value before carrying out some statistical analysis.

Note: The presence of outliers can be checked using the boxplot.stats() function from grDevices package. Use ?boxplot.stats for more information.

First, let’s replace the outliers with the mean value on one column at a time, and then the next column, and so on.

library(dplyr)

df %>%
  mutate(
    age = case_when(
      age %in% boxplot.stats(age)$out ~ mean(age, na.rm = T),
      TRUE ~ age
    ),
    income = case_when(
      income %in% boxplot.stats(income)$out ~ mean(income, na.rm = T),
      TRUE ~ income
    ),
    education_level = case_when(
      education_level %in% boxplot.stats(education_level)$out ~ mean(education_level, na.rm = T),
      TRUE ~ education_level
    )
  )

##    age   income education_level gender
## 1   21 2357.000              13   Male
## 2   28 2300.000              16   Male
## 3   23 2421.000              14   Male
## 4   20 2928.000              20 Female
## 5   25 2680.000              18   Male
## 6   30 2540.000              19   Male
## 7   20 2665.000              12 Female
## 8   30 2441.000              17 Female
## 9   19 2379.000              19   Male
## 10  23 2438.000              19 Female
## 11  29 3013.545              25   Male

Imagine what if you had 100 additional numeric columns in the data set from which you wanted to handle the outliers? using the above approach you would have to write a very long chunk of code inside the mutate() function.

Alternatively, you can create a custom function to deal with the outliers and apply it to all numeric columns using the across() function as below.

# creating a function to replace outliers with the mean value
outliers_to_mean <- function(x) {
  x = case_when(
    x %in% boxplot.stats(x)$out ~ mean(x, na.rm = T),
    TRUE ~ x
  )
}

# using the custom function inside the across() function
df %>% 
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = outliers_to_mean
    )
  )

##    age   income education_level gender
## 1   21 2357.000              13   Male
## 2   28 2300.000              16   Male
## 3   23 2421.000              14   Male
## 4   20 2928.000              20 Female
## 5   25 2680.000              18   Male
## 6   30 2540.000              19   Male
## 7   20 2665.000              12 Female
## 8   30 2441.000              17 Female
## 9   19 2379.000              19   Male
## 10  23 2438.000              19 Female
## 11  29 3013.545              25   Male

The question is what if you need the outliers_to_mean() function only in this specific instance, and you know you would not use it again in the future?

In this case, it is better to use an anonymous function instead of a named function. For instance, here is how you can use the above code as an anonymous function that produces the same result.

df %>% 
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = function(x) {
        x = case_when(
          x %in% boxplot.stats(x)$out ~ mean(x, na.rm = T),
          TRUE ~ x
          )
      }
    )
  )

##    age   income education_level gender
## 1   21 2357.000              13   Male
## 2   28 2300.000              16   Male
## 3   23 2421.000              14   Male
## 4   20 2928.000              20 Female
## 5   25 2680.000              18   Male
## 6   30 2540.000              19   Male
## 7   20 2665.000              12 Female
## 8   30 2441.000              17 Female
## 9   19 2379.000              19   Male
## 10  23 2438.000              19 Female
## 11  29 3013.545              25   Male

There is another way to even reduce the number of unnecessary stuff and write fewer lines of code. The .fns argument accepts purrr-style lambda as well. For example, in the above chunk of code, you can replace function(x){} with ~{} which is just another way of saying that you have started an anonymous function.

Note: while using ~, the .x symbol can be used to refer to the argument of the anonymous function.

df %>% 
  mutate(
    across(
      .cols = where(is.numeric),
      .fns = 
        ~{.x = case_when(
          .x %in% boxplot.stats(.x)$out ~ mean(.x, na.rm = T),
          TRUE ~ .x
          )
        }
    )
  )

##    age   income education_level gender
## 1   21 2357.000              13   Male
## 2   28 2300.000              16   Male
## 3   23 2421.000              14   Male
## 4   20 2928.000              20 Female
## 5   25 2680.000              18   Male
## 6   30 2540.000              19   Male
## 7   20 2665.000              12 Female
## 8   30 2441.000              17 Female
## 9   19 2379.000              19   Male
## 10  23 2438.000              19 Female
## 11  29 3013.545              25   Male