I have a data.table
with more than 200 variables which are all binary. I want to create a new column in it that counts the difference between each row and a reference vector:
#Example
dt = data.table(
"V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0),
"V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0),
"V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0),
"V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0),
"V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0)
)
reference = c(1,1,0,1,0)
I can do that with a small for loop, such as
distance = NULL
for(i in 1:nrow(dt)){
distance[i] = sum(reference != dt[i,])
}
But it's kind of slow and surely not the best way to do this. I tried:
dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))]
dt[,"distance":= sum(reference != .SD)]
But neither works, as they return the same value for all rows. Also, a solution where I don't have to type all the variable names would be much better, as the real data.table has over 200 columns
Excel's INDEX function allows users to reference values in a range of data (or array of data) by their column and row number position within that range. As a simple example, the formula =INDEX(A1:F10, 4,4) would return the value in the fourth row of the fourth column in that specified data range.
The @ symbol inside a table means "this row". You can combine this with a column name to reference a particular cell in the current row. To reference the Total row, use the #Totals specifier. To reference a specific total, use a double set of brackets, and the column name, just like the headers row.
You can use sweep()
with rowSums
, i.e.
rowSums(sweep(dt, 2, reference) != 0)
#[1] 2 2 2 2 4 4 3 2 4 3 2 1 3 4 1 3
BENCHMARK
HUGH <- function(dt) {
dt[, I := .I]
distance_by_I <- melt(dt, id.vars = "I")[, .(distance = sum(reference != value)), keyby = "I"]
return(dt[distance_by_I, on = "I"])
}
Sotos <- function(dt) {
return(rowSums(sweep(dt, 2, reference) != 0))
}
dt1 <- as.data.table(replicate(5, sample(c(0, 1), 100000, replace = TRUE)))
microbenchmark(HUGH(dt1), Sotos(dt1))
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# HUGH(dt1) 112.71936 117.03380 124.05758 121.6537 128.09904 155.68470 100 b
# Sotos(dt1) 23.66799 31.11618 33.84753 32.8598 34.02818 68.75044 100 a
Another:
ref = as.list(reference)
dt[, Reduce(`+`, Map(`!=`, .SD, ref))]
How it works. So we're taking each vector column in .SD
and comparing it to the single corresponding value in ref
. The !=
function is vectorized, so each element of ref
is recycled out to match the length of each vector.
This Map
call returns a list of TRUE/FALSE vectors, one for each column. When we add up TRUE/FALSE values, they are treated as 1/0, so we just need to add these columns up. This can be achieved by passing the pairwise operator +
between the first column and the second; and then again between the result of that computation and the third column; and so on. This is how Reduce
works. It might be more readable as
x = dt[, Map(`!=`, .SD, ref)]
Reduce(`+`, x, init = 0L)
which can be read as
See also ?Map
and ?Reduce
.
Timings. I'm modifying the benchmark data, since using integers seems a lot saner if the OP really has 0-1 data. Also, adding more columns since the OP says they have a lot. Finally, editing Hugh's answer to be comparable to the others:
HUGH <- function(dt, r) {
dt[, I := .I]
res <- melt(dt, id.vars = "I")[, .(distance = sum(r != value)), keyby = "I"]$distance
dt[, I := NULL]
res
}
Sotos <- function(dt, r) {
return(rowSums(sweep(dt, 2, r) != 0))
}
mm <- function(dt, r){
colSums(t(dt) != r)
}
ff <- function(DT, r){
DT[, Reduce(`+`, Map(`!=`, .SD, r))]
}
nr = 20000
nc = 500
dt1 <- as.data.table(replicate(nc, sample(0:1, nr, replace = TRUE)))
ref <- rep(as.integer(reference), length.out=nc)
lref = as.list(ref)
identical(HUGH(dt1, ref), ff(dt1, lref)) # integer output
identical(mm(dt1, ref), Sotos(dt1, ref)) # numeric output
all.equal(HUGH(dt1, ref), mm(dt1, ref)) # but they match
# all TRUE
microbenchmark::microbenchmark(times = 3,
HUGH(dt1, ref),
Sotos(dt1, ref),
mm(dt1, ref),
ff(dt1, lref)
)
Result:
Unit: milliseconds
expr min lq mean median uq max neval
HUGH(dt1, ref) 365.0529 370.05233 378.8826 375.0517 385.79737 396.5430 3
Sotos(dt1, ref) 871.5693 926.50462 961.5527 981.4400 1006.54437 1031.6488 3
mm(dt1, ref) 104.5631 121.74086 131.7157 138.9186 145.29197 151.6653 3
ff(dt1, lref) 87.0800 87.48975 93.1361 87.8995 96.16415 104.4288 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With