Suppose I have a data.table like this:
Table:
V1 V2
A B
C D
C A
B A
D C
I want each row to be regarded as a set, which means that B A and A B are the same. So after the process, I want to get:
V1 V2
A B
C D
C A
In order to do that, I have to first sort the table row-by-row and then use unique
to remove the duplicates. The sorting process is quite slow if I have millions of rows. So is there an easy way to remove the duplicates without sorting?
For just two columns you can use the following trick:
dt = data.table(a = letters[1:5], b = letters[5:1])
# a b
#1: a e
#2: b d
#3: c c
#4: d b
#5: e a
dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1]
# a b
#1: a e
#2: b d
#3: c c
Borrowing (probably unrealistic) data from a dupe:
library(data.table)
size <- 118000000
key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) )
val <- runif(size, 0.0, 5.0)
dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)
Here's a fast way if your data looks like this:
# eddi's answer
system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1])
# user system elapsed
# 101.79 3.01 107.98
# optimized for this data
system.time({
dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)]
res2 <- unique(dt2, by=c("key1", "key2"))
})
# user system elapsed
# 8.50 1.16 4.93
fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2)
# [1] TRUE
Data like this seems unlikely if it pertains to covariances, since you should have at most one duplicate (ie, A-B with B-A).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With