For people who have come here to look for a general answer for duplicate row removal, use !duplicated()
:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
duplicated(df)
[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE
> df[duplicated(df), ]
a b
2 A 1
6 B 1
8 C 2
> df[!duplicated(df), ]
a b
1 A 1
3 A 2
4 B 4
5 B 1
7 C 2
Answer from: Removing duplicated rows from R data frame
just isolate your data frame to the columns you need, then use the unique function :D
# in the above example, you only need the first three columns
deduped.data <- unique( yourdata[ , 1:3 ] )
# the fourth column no longer 'distinguishes' them,
# so they're duplicates and thrown out.
The function distinct()
in the dplyr
package performs arbitrary duplicate removal, either from specific columns/variables (as in this question) or considering all columns/variables. dplyr
is part of the tidyverse
.
Data and package
library(dplyr)
dat <- data.frame(a = rep(c(1,2),4), b = rep(LETTERS[1:4],2))
Remove rows duplicated in a specific column (e.g., columna
)
Note that .keep_all = TRUE
retains all columns, otherwise only column a
would be retained.
distinct(dat, a, .keep_all = TRUE)
a b
1 1 A
2 2 B
Remove rows that are complete duplicates of other rows:
distinct(dat)
a b
1 1 A
2 2 B
3 1 C
4 2 D
The data.table
package also has unique
and duplicated
methods of it's own with some additional features.
Both the unique.data.table
and the duplicated.data.table
methods have an additional by
argument which allows you to pass a character
or integer
vector of column names or their locations respectively
library(data.table)
DT <- data.table(id = c(1,1,1,2,2,2),
val = c(10,20,30,10,20,30))
unique(DT, by = "id")
# id val
# 1: 1 10
# 2: 2 10
duplicated(DT, by = "id")
# [1] FALSE TRUE TRUE FALSE TRUE TRUE
Another important feature of these methods is a huge performance gain for larger data sets
library(microbenchmark)
library(data.table)
set.seed(123)
DF <- as.data.frame(matrix(sample(1e8, 1e5, replace = TRUE), ncol = 10))
DT <- copy(DF)
setDT(DT)
microbenchmark(unique(DF), unique(DT))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# unique(DF) 44708.230 48981.8445 53062.536 51573.276 52844.591 107032.18 100 b
# unique(DT) 746.855 776.6145 2201.657 864.932 919.489 55986.88 100 a
microbenchmark(duplicated(DF), duplicated(DT))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# duplicated(DF) 43786.662 44418.8005 46684.0602 44925.0230 46802.398 109550.170 100 b
# duplicated(DT) 551.982 558.2215 851.0246 639.9795 663.658 5805.243 100 a
the general answer can be for example:
df <- data.frame(rbind(c(2,9,6),c(4,6,7),c(4,6,7),c(4,6,7),c(2,9,6))))
new_df <- df[-which(duplicated(df)), ]
X1 X2 X3
1 2 9 6
2 4 6 7
You can also use dplyr
's distinct()
function! It tends to be more efficient than alternative options, especially if you have loads of observations.
distinct_data <- dplyr::distinct(yourdata)
With sqldf
:
# Example by Mehdi Nellen
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
Solution:
library(sqldf)
sqldf('SELECT DISTINCT * FROM df')
Output:
a b
1 A 1
2 A 2
3 B 4
4 B 1
5 C 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With