Detecting Duplicates (base R vs. dplyr)

Fahim Ahmad | 2021-07-23

Sometimes you may encounter duplicated values in the data which might cause problems depending on how you plan to use the data. In this post, I provide an overview of duplicated() function from base R and the distinct() function from dplyr package to detect and remove duplicates.

I will be using the following data frame as an example in this post.

set.seed(1000)
df <- data.frame(
  ID = sample(paste0("ID-00", 1:7), 10, replace = T),
  value_1 = sample(1:7, 10, replace = T),
  value_2 = sample(2:5, 10, replace = T)
)

df

##        ID value_1 value_2
## 1  ID-004       5       2
## 2  ID-003       6       5
## 3  ID-006       1       3
## 4  ID-003       1       4
## 5  ID-005       5       5
## 6  ID-003       2       3
## 7  ID-005       2       2
## 8  ID-002       4       3
## 9  ID-006       7       2
## 10 ID-006       2       3

Removing duplicates based on a single variable

The duplicated() function returns a logical vector where TRUE specifies which rows of the data frame are duplicates.

For instance, duplicated(df["ID"]) returns the following vector.

##  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

Note: the duplicated() function preserves the first occurrence in the process of identifying the duplicate values, if you want to consider the duplication from the reverse side, then set the fromLast argument to TRUE.

Let’s use the above vector to exclude the duplicated values.

df[!duplicate_rows, ]

##       ID value_1 value_2
## 1 ID-004       5       2
## 2 ID-003       6       5
## 3 ID-006       1       3
## 5 ID-005       5       5
## 8 ID-002       4       3

An alternative way to select unique values is dplyr::distinct() function that yields a similar result as above.

dplyr::distinct(df, ID, .keep_all = TRUE)

##       ID value_1 value_2
## 1 ID-004       5       2
## 2 ID-003       6       5
## 3 ID-006       1       3
## 4 ID-005       5       5
## 5 ID-002       4       3

Note: The .keep_all argument is used to retain all other columns in the output data frame.

Removing duplicates based on the combination of multiple variables

The above chunks of codes remove the duplicated rows based on a single column. What if we want to remove duplicates based on more than a single column?

One way is to concatenate the columns in which you want to check the presence of duplicates. For example, let’s remove the rows where value_1 and value_2 are duplicated.

df <- dplyr::mutate(df, value_1_2 = paste(value_1, value_2))
df[!duplicated(df[c("value_1_2")]), ]

##       ID value_1 value_2 value_1_2
## 1 ID-004       5       2       5 2
## 2 ID-003       6       5       6 5
## 3 ID-006       1       3       1 3
## 4 ID-003       1       4       1 4
## 5 ID-005       5       5       5 5
## 6 ID-003       2       3       2 3
## 7 ID-005       2       2       2 2
## 8 ID-002       4       3       4 3
## 9 ID-006       7       2       7 2

Below is an efficient way of detecting duplicates based on the combination of multiple columns without concatenating the values of the columns in which we want to identify the duplicated values:

# using duplicated() function
df[!duplicated(df[c("value_1", "value_2")]), ]

# using distinct() function
dplyr::distinct(df, value_1, value_2, .keep_all = TRUE)

##       ID value_1 value_2 value_1_2
## 1 ID-004       5       2       5 2
## 2 ID-003       6       5       6 5
## 3 ID-006       1       3       1 3
## 4 ID-003       1       4       1 4
## 5 ID-005       5       5       5 5
## 6 ID-003       2       3       2 3
## 7 ID-005       2       2       2 2
## 8 ID-002       4       3       4 3
## 9 ID-006       7       2       7 2

Conclusion

It seems that both approaches work very well; however, the advantage of using duplicated() function from base R is it returns a logical vector identifying the duplicated rows that can be used to either drop the duplicated rows or keep only these rows for further investigation while the distinct() function directly removes the duplicated rows without specifying which row has duplicate values.

For instance, let’s keep the duplicated ID numbers only.

df[duplicated(df["ID"], fromLast = F), ]

##        ID value_1 value_2 value_1_2
## 4  ID-003       1       4       1 4
## 6  ID-003       2       3       2 3
## 7  ID-005       2       2       2 2
## 9  ID-006       7       2       7 2
## 10 ID-006       2       3       2 3

As mentioned above, the duplicated() function does not assign the first occurrence in the process of identifying the duplicated values as duplicates. Thus, we need to count backward as well to consider the duplication from the reverse side. To do so, we need to set the fromLast argument to TRUE.

df[duplicated(df["ID"], fromLast = F) | duplicated(df["ID"], fromLast = T), ]

##        ID value_1 value_2 value_1_2
## 2  ID-003       6       5       6 5
## 3  ID-006       1       3       1 3
## 4  ID-003       1       4       1 4
## 5  ID-005       5       5       5 5
## 6  ID-003       2       3       2 3
## 7  ID-005       2       2       2 2
## 9  ID-006       7       2       7 2
## 10 ID-006       2       3       2 3