I have a huge dataset ( > 2.5 Million). A small subset looks like this (code reproducible)
temp <- data.frame(list(col1 = c("424", "560", "557"),
col2 = c("276", "427", "V46"),
col3 = c("780", "V45", "584"),
col4 = c("276", "V45", "995"),
col5 = c("428", "799", "427")))
> temp
col1 col2 col3 col4 col5
1 424 276 780 276 428
2 560 427 V45 V45 799
3 557 V46 584 995 427
I am trying to remove duplicates per row, and shifting values left, using this code
library(plyr)
temp <- apply(temp,1,function(x) unique(unlist(x)))
temp <- ldply(temp, rbind)
> temp
1 2 3 4 5
1 424 276 780 428 <NA>
2 560 427 V45 799 <NA>
3 557 V46 584 995 427
I am successfull in doing this, however when I extend the above code to my original huge dataset, I am facing performance issues.
because I am using apply
, the code takes lot of time to execute
Can I improve this?
A simpler function in apply
which should speed things up. We use the fact that indexing with a number larger than length(x)
results in NA
.
nc <- ncol(temp)
t(apply(temp, 1, function(x) unique(x)[1:nc]))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "424" "276" "780" "428" NA
# [2,] "560" "427" "V45" "799" NA
# [3,] "557" "V46" "584" "995" "427"
A data.table
alternative in a similar vein, but the update is done on the data in a long format instead. The benchmark below suggests that this may be slightly faster.
setDT(temp)
nc <- ncol(temp)
dcast(melt(temp[, ri := seq_len(.N)], id.var = "ri")[
, value := unique(value)[1:nc], by = ri], ri ~ variable)[ , ri := NULL][]
# col1 col2 col3 col4 col5
# 1: 424 276 780 428 NA
# 2: 560 427 V45 799 NA
# 3: 557 V46 584 995 427
Benchmark on data of the size mentioned in OP. In the benchmark by F. Privé on a 1e5
data set, OP alternative was slower, and it is not included here.
temp <- temp[sample(nrow(temp), size = 3e6, replace = TRUE), ]
microbenchmark::microbenchmark(
privefl = {
p <- ncol(temp)
myf <- compiler::cmpfun(
function(x) {
un <- unique(x)
d <- p - length(un)
if (d > 0) {
un <- c(un, rep(NA_character_, d))
}
un
}
)
as.data.frame(t(apply(t(temp), 2, myf)))},
h1 = {nc <- ncol(temp)
as.data.frame(t(apply(temp, 1, function(x) unique(x)[1:nc])))},
h2 = {d <- as.data.table(temp)
nc <- ncol(d)
dcast(melt(d[, ri := seq_len(.N)], id.var = "ri")[
, value := unique(value)[1:nc], by = ri], ri ~ variable)[ , ri := NULL]},
times = 20, unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval cld
# privefl 1.312071 1.342116 1.341450 1.354268 1.403343 1.243641 20 b
# h1 1.227693 1.270512 1.270115 1.332642 1.301049 1.156123 20 b
# h2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With