Sometimes you may encounter duplicated values in the data
which might cause problems depending on how you plan to use the
data. In this post, I provide an overview of
duplicated()
function from base R and the
distinct()
function from dplyr
package to detect and remove duplicates.
I will be using the following data frame as an example in this post.
set.seed(1000)
df <- data.frame(
ID = sample(paste0("ID-00", 1:7), 10, replace = T),
value_1 = sample(1:7, 10, replace = T),
value_2 = sample(2:5, 10, replace = T)
)
df
## ID value_1 value_2
## 1 ID-004 5 2
## 2 ID-003 6 5
## 3 ID-006 1 3
## 4 ID-003 1 4
## 5 ID-005 5 5
## 6 ID-003 2 3
## 7 ID-005 2 2
## 8 ID-002 4 3
## 9 ID-006 7 2
## 10 ID-006 2 3
The duplicated()
function returns a logical
vector where TRUE
specifies which rows of the data
frame are duplicates.
For instance, duplicated(df["ID"])
returns the
following vector.
## [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
Note: the
duplicated()
function preserves the first occurrence in the process of identifying the duplicate values, if you want to consider the duplication from the reverse side, then set thefromLast
argument toTRUE
.
Let’s use the above vector to exclude the duplicated values.
## ID value_1 value_2
## 1 ID-004 5 2
## 2 ID-003 6 5
## 3 ID-006 1 3
## 5 ID-005 5 5
## 8 ID-002 4 3
An alternative way to select unique values is
dplyr::distinct()
function that yields a similar
result as above.
## ID value_1 value_2
## 1 ID-004 5 2
## 2 ID-003 6 5
## 3 ID-006 1 3
## 4 ID-005 5 5
## 5 ID-002 4 3
Note: The .
keep_all
argument is used to retain all other columns in the output data frame.
The above chunks of codes remove the duplicated rows based on a single column. What if we want to remove duplicates based on more than a single column?
One way is to concatenate the columns in which you want to check the presence of duplicates. For example, let’s remove the rows where value_1 and value_2 are duplicated.
## ID value_1 value_2 value_1_2
## 1 ID-004 5 2 5 2
## 2 ID-003 6 5 6 5
## 3 ID-006 1 3 1 3
## 4 ID-003 1 4 1 4
## 5 ID-005 5 5 5 5
## 6 ID-003 2 3 2 3
## 7 ID-005 2 2 2 2
## 8 ID-002 4 3 4 3
## 9 ID-006 7 2 7 2
Below is an efficient way of detecting duplicates based on the combination of multiple columns without concatenating the values of the columns in which we want to identify the duplicated values:
# using duplicated() function
df[!duplicated(df[c("value_1", "value_2")]), ]
# using distinct() function
dplyr::distinct(df, value_1, value_2, .keep_all = TRUE)
## ID value_1 value_2 value_1_2
## 1 ID-004 5 2 5 2
## 2 ID-003 6 5 6 5
## 3 ID-006 1 3 1 3
## 4 ID-003 1 4 1 4
## 5 ID-005 5 5 5 5
## 6 ID-003 2 3 2 3
## 7 ID-005 2 2 2 2
## 8 ID-002 4 3 4 3
## 9 ID-006 7 2 7 2
It seems that both approaches work very well; however, the
advantage of using duplicated()
function from base
R is it returns a logical vector identifying the duplicated rows
that can be used to either drop the duplicated rows or keep only
these rows for further investigation while the
distinct()
function directly removes the duplicated
rows without specifying which row has duplicate values.
For instance, let’s keep the duplicated ID numbers only.
## ID value_1 value_2 value_1_2
## 4 ID-003 1 4 1 4
## 6 ID-003 2 3 2 3
## 7 ID-005 2 2 2 2
## 9 ID-006 7 2 7 2
## 10 ID-006 2 3 2 3
As mentioned above, the duplicated()
function
does not assign the first occurrence in the process of
identifying the duplicated values as duplicates. Thus, we need
to count backward as well to consider the duplication from the
reverse side. To do so, we need to set the fromLast
argument to TRUE
.
## ID value_1 value_2 value_1_2
## 2 ID-003 6 5 6 5
## 3 ID-006 1 3 1 3
## 4 ID-003 1 4 1 4
## 5 ID-005 5 5 5 5
## 6 ID-003 2 3 2 3
## 7 ID-005 2 2 2 2
## 9 ID-006 7 2 7 2
## 10 ID-006 2 3 2 3