I have a <code>data.table</code> with more than 200 variables which are all binary. I want to create a new column in it that counts the difference between each row and a reference vector: <pre class="prettyprint"><code>#Example dt = data.table( "V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0), "V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0), "V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0), "V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0), "V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0) ) reference = c(1,1,0,1,0) </code></pre> I can do that with a small for loop, such as <pre class="prettyprint"><code>distance = NULL for(i in 1:nrow(dt)){ distance[i] = sum(reference != dt[i,]) } </code></pre> But it's kind of slow and surely not the best way to do this. I tried: <pre class="prettyprint"><code>dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))] dt[,"distance":= sum(reference != .SD)] </code></pre> But neither works, as they return the same value for all rows. Also, a solution where I don't have to type all the variable names would be much better, as the real data.table has over 200 columns

You can use <code>sweep()</code> with <code>rowSums</code>, i.e. <pre class="prettyprint"><code>rowSums(sweep(dt, 2, reference) != 0) #[1] 2 2 2 2 4 4 3 2 4 3 2 1 3 4 1 3 </code></pre> BENCHMARK <pre class="prettyprint"><code>HUGH <- function(dt) { dt[, I := .I] distance_by_I <- melt(dt, id.vars = "I")[, .(distance = sum(reference != value)), keyby = "I"] return(dt[distance_by_I, on = "I"]) } Sotos <- function(dt) { return(rowSums(sweep(dt, 2, reference) != 0)) } dt1 <- as.data.table(replicate(5, sample(c(0, 1), 100000, replace = TRUE))) microbenchmark(HUGH(dt1), Sotos(dt1)) #Unit: milliseconds # expr min lq mean median uq max neval cld # HUGH(dt1) 112.71936 117.03380 124.05758 121.6537 128.09904 155.68470 100 b # Sotos(dt1) 23.66799 31.11618 33.84753 32.8598 34.02818 68.75044 100 a </code></pre>

How do I reference the entire row when creating a new column in a data.table?

Tags:

r

data.table

I have a data.table with more than 200 variables which are all binary. I want to create a new column in it that counts the difference between each row and a reference vector:

#Example
dt = data.table(
"V1" = c(1,1,0,1,0,0,0,1,0,1,0,1,1,0,1,0),
"V2" = c(0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0),
"V3" = c(0,0,0,1,1,1,1,0,1,0,1,0,1,0,1,0),
"V4" = c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0),
"V5" = c(1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0)  
)

reference = c(1,1,0,1,0)

I can do that with a small for loop, such as

distance = NULL
for(i in 1:nrow(dt)){      
  distance[i] = sum(reference != dt[i,])  
}

But it's kind of slow and surely not the best way to do this. I tried:

dt[,"distance":= sum(reference != c(V1,V2,V3,V4,V5))]
dt[,"distance":= sum(reference != .SD)]

But neither works, as they return the same value for all rows. Also, a solution where I don't have to type all the variable names would be much better, as the real data.table has over 200 columns

894

asked Jan 24 '19 12:01

Fino

2 Answers

You can use sweep() with rowSums, i.e.

rowSums(sweep(dt, 2, reference) != 0)
 #[1] 2 2 2 2 4 4 3 2 4 3 2 1 3 4 1 3

BENCHMARK

HUGH <- function(dt) {
    dt[, I := .I] 
    distance_by_I <- melt(dt, id.vars = "I")[, .(distance = sum(reference != value)), keyby = "I"]
    return(dt[distance_by_I, on = "I"])
}

Sotos <- function(dt) {
    return(rowSums(sweep(dt, 2, reference) != 0))
}

dt1 <- as.data.table(replicate(5, sample(c(0, 1), 100000, replace = TRUE)))
microbenchmark(HUGH(dt1), Sotos(dt1))

#Unit: milliseconds
#       expr       min        lq      mean   median        uq       max neval cld
#  HUGH(dt1) 112.71936 117.03380 124.05758 121.6537 128.09904 155.68470   100   b
# Sotos(dt1)  23.66799  31.11618  33.84753  32.8598  34.02818  68.75044   100  a

175

answered Sep 21 '22 00:09

Sotos

Another:

ref = as.list(reference)
dt[, Reduce(`+`, Map(`!=`, .SD, ref))]

How it works. So we're taking each vector column in .SD and comparing it to the single corresponding value in ref. The != function is vectorized, so each element of ref is recycled out to match the length of each vector.

This Map call returns a list of TRUE/FALSE vectors, one for each column. When we add up TRUE/FALSE values, they are treated as 1/0, so we just need to add these columns up. This can be achieved by passing the pairwise operator + between the first column and the second; and then again between the result of that computation and the third column; and so on. This is how Reduce works. It might be more readable as

x = dt[, Map(`!=`, .SD, ref)]
Reduce(`+`, x, init = 0L)

which can be read as

v = 0
for each xi in x, update v = v + xi

Frank

Related questions
                            
                                ggplot2 facet_wrap: only use x-axis labels existing in each group
                            
                                R rolling up rows to a single row (continuous & factor variables)
                            
                                Get the last element of a matrix
                            
                                Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc
                            
                                Looping grepl() through data.table (R)
                            
                                Error in running randomForest : object not found
                            
                                ggplot: plotting layers only if certain criteria are met
                            
                                Count the number of duplicate for a column
                            
                                How to add a title for a grid.layout figure in ggplot2? [duplicate]
                            
                                Pipe a data frame to a function whose argument pipes a dot
                            
                                Change the fill color of one of the dodged bar in ggplot
                            
                                R: frequency with group by ID [duplicate]
                            
                                Rolling sums for groups with uneven time gaps
                            
                                R Split String By Delimiter in a column
                            
                                Get start and end index of runs of values [duplicate]
                            
                                Extract only folder name right before filename from full path
                            
                                Split character string by forward slash or nothing
                            
                                Replace multiple variables in Sprintf with same value
                            
                                R - Running a t-test from piping operators
                            
                                Define an anonymous function without using the `function` keyword

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With