At some point in your life you have probably found yourself repeating the same function across multiple columns. Although there isn’t anything inherently wrong with doing this, sometimes you might want to simplify and shorten your code to have a neat script.
In R, the across()
function allows you to do
data transformation or statistical analysis across all or a
subset of columns of the data frame.
I am using the iris data as an example in this post. Let’s have a glance at it, it has five columns - 4 numeric and a factor column.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s say we want to find out the average of all numeric columns.
One way of doing this is to feed the mean()
with
each column or combine the summarise()
and
mean()
functions to have the result as a data
frame.
library(dplyr)
# mean(iris$Sepal.Length)
# mean(iris$Sepal.Width)
# mean(iris$Petal.Length)
# mean(iris$Petal.Width)
iris %>%
summarise(
Sepal.Length_Mean = mean(Sepal.Length),
Sepal.Width_Mean = mean(Sepal.Width),
Petal.Length_Mean = mean(Petal.Length),
Petal.Width_Mean = mean(Petal.Width)
)
## Sepal.Length_Mean Sepal.Width_Mean Petal.Length_Mean Petal.Width_Mean
## 1 5.843333 3.057333 3.758 1.199333
With the help of across()
you can achieve the
same result using the below chunk of code:
iris %>%
summarise(
across(
.cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
.fns = mean
)
)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.843333 3.057333 3.758 1.199333
My first impression while using the across()
function was “So what?” what is the
advantage of using this approach if I need to write all columns
names which I want to apply the function on them, and I am
sure if you are new to R you might have thought the same when
using this function for the first time.
Below are some reasons I think why we should use the
across()
function while performing data
transformation or providing the same summary statistics of
multiple columns instead of repeating the same function multiple
times.
You can use the .names
argument to control the
output names that will result from the .fns
operation. The .names
argument can take two special
keywords, “{col}” and “{fn}”
The default value is “{col}”, the names of the resulting columns will be identical to the column names that were passed in the “.cols” argument.
iris %>%
summarise(
across(
.cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
.fns = mean,
.names = "{col}"
)
)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.843333 3.057333 3.758 1.199333
Let’s modify the names of the columns in a way that the names of the resulting columns should have the “_mean” as a suffix.
iris %>%
summarise(
across(
.cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
.fns = mean,
.names = "{col}_mean"
)
)
## Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1 5.843333 3.057333 3.758 1.199333
The special “{fn}” keyword comes in handy when you want to modify the column names in a way that the resulting data frame should have the same columns names and function name regardless of which function is used.
# a named list of functions should be passed to the .fns argument
iris %>%
summarise(
across(
.cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
.fns = list(mean = mean, median = median),
.names = "{col}_{fn}"
)
)
## Sepal.Length_mean Sepal.Length_median Sepal.Width_mean Sepal.Width_median Petal.Length_mean
## 1 5.843333 5.8 3.057333 3 3.758
## Petal.Length_median Petal.Width_mean Petal.Width_median
## 1 4.35 1.199333 1.3
You can use the advantages of select helpers with the
across()
function to select a set of columns in
different ways rather than passing the column names using a
vector notation
c(column1, column2, column3, ...)
starts_with()
applies the function to the
column whose names start with a specific character.iris %>%
summarise(
across(
.cols = starts_with("Sepal"),
.fns = list(mean = mean),
.names = "{col}_{fn}"
)
)
## Sepal.Length_mean Sepal.Width_mean
## 1 5.843333 3.057333
ends_with()
applies the function to the column
whose names end with a specific character.iris %>%
summarise(
across(
.cols = ends_with("Width"),
.fns = list(mean = mean),
.names = "{col}_{fn}"
)
)
## Sepal.Width_mean Petal.Width_mean
## 1 3.057333 1.199333
contains()
applies the function to the column
whose names contain a specific character.iris %>%
summarise(
across(
.cols = contains("."),
.fns = list(mean = mean),
.names = "{col}_{fn}"
)
)
## Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1 5.843333 3.057333 3.758 1.199333
where()
apply the function to all columns that
satisfy a logical condition.iris %>%
summarise(
across(
.cols = where(is.numeric),
.fns = list(mean = mean),
.names = "{col}_{fn}"
)
)
## Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1 5.843333 3.057333 3.758 1.199333
## Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean
## 1 5.843333 3.057333 3.758 1.199333
Another significant advantage of using the
across()
function is that you can pass multiple
functions to multiple columns at once. For instance, let’s say
you want to find out the minimum, mean, maximum, and standard
deviation of all numeric columns in the data.
iris %>%
summarise(
across(
.cols = 1:4,
.fns = list(
Min = min,
Mean = mean,
Max = max,
sd = sd
),
.names = "{col}_{fn}"
)
)
## Sepal.Length_Min Sepal.Length_Mean Sepal.Length_Max Sepal.Length_sd Sepal.Width_Min
## 1 4.3 5.843333 7.9 0.8280661 2
## Sepal.Width_Mean Sepal.Width_Max Sepal.Width_sd Petal.Length_Min Petal.Length_Mean
## 1 3.057333 4.4 0.4358663 1 3.758
## Petal.Length_Max Petal.Length_sd Petal.Width_Min Petal.Width_Mean Petal.Width_Max Petal.Width_sd
## 1 6.9 1.765298 0.1 1.199333 2.5 0.7622377