Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare if any of the elements in a row is same

Tags:

r

data.table

Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.

DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
                 B=c("a","b","c","c","f","g",NA,"j",NA),
                 C=c("a","b","c","b","g","h",NA,NA,NA))

> DT
   A  B  C
1: a  a  a
2: a  b  b
3: b  c  c
4: d  c  b
5: e  f  g
6: f  g  h
7: h NA NA
8: i  j NA
9: j NA NA

I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:

 > DT
   A  B  C  D
1: a  a  a  0 #No row above to compare; could be either NA or 0
2: a  b  b  1 #row 2 has "a", which is in row 1; returns 1
3: b  c  c  1 #row 3 has "b", which is in row 2; returns 1
4: d  c  b  1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e  f  g  0 #row 5 has nothing that is in row 4; returns 0
6: f  g  h  1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA  1 #row 7 has "h", which is in row 6; returns 1
8: i  j NA  0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA  1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)

The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)

like image 400
wyatt Avatar asked Dec 06 '22 15:12

wyatt


2 Answers

We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer

DT[, D := {
     v1 <- do.call(paste, .SD)
     v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
     v2N <- gsub("NA\\|*|\\|*NA", "", v2)
     v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
     as.integer(head(c(FALSE, v3), -1))        
}]

DT
#   A  B  C D
#1: a  a  a 0
#2: a  b  b 1
#3: b  c  c 1
#4: d  c  b 1
#5: e  f  g 0
#6: f  g  h 1
#7: h NA NA 1
#8: i  j NA 0
#9: j NA NA 1

Or we can do a split and do comparison using Map

as.integer(c(FALSE, unlist(Map(function(x,y) {
     x1 <- na.omit(unlist(x))
     y1 <- na.omit(unlist(y))
    any(x1 %in% y1 | y1 %in% x1)  },
     split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
like image 71
akrun Avatar answered Dec 08 '22 04:12

akrun


Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.

DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
   sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
   by=1:nrow(DT)]$V1)]

Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.

It returns

DT
   A  B  C D
1: a  a  a 0
2: a  b  b 1
3: b  c  c 1
4: d  c  b 1
5: e  f  g 0
6: f  g  h 1
7: h NA NA 1
8: i  j NA 0
9: j NA NA 1
like image 26
lmo Avatar answered Dec 08 '22 05:12

lmo