Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding rows with a unique combination of values (R)

This is a bit more complicated that the title lets on, and I'm sure if I could think of a way to better describe it, I could google it better.

I have data that looks like this:

SET                     ID    
100301006              1287025
100301006              1287026
100301010              1287027
100301013              1287030
100301011              1287027

and I would like to identify and select those rows where each both values in a row have a unique value for the column. In the example above, I want to grab only the row:

100301013              1287030

I don't want SET 100301006, since it matches to 2 different records in the ID field (1287025 and 1287026). Similarly, I don't want SET 100301010 since the ID record it matches to (1287027) can also match another SET (10030011).

In some cases there could be more than 2 matches.

I could do this in loops, but that seems like a hack. I'd love a base R or data.table solution, but I'm not so interested in dplyr (trying to minimize dependencies).

like image 491
gruvn Avatar asked Dec 22 '22 20:12

gruvn


1 Answers

We can use duplicated on each columns independently to create a list of logical vectors, Reduce it to a single vector with & and use that to subset the rows of the dataset

df1[Reduce(`&`, lapply(df1, function(x) 
         !(duplicated(x)|duplicated(x, fromLast = TRUE)))),]
#     SET      ID
#4 100301013 1287030

Or as @chinsoon12 suggested

 m1 <- sapply(df1, function(x) !(duplicated(x)| duplicated(x, fromLast = TRUE)))
 df1[rowSums(m1) == ncol(m1),, drop = FALSE]

data

df1 <- structure(list(SET = c(100301006L, 100301006L, 100301010L, 100301013L, 
100301011L), ID = c(1287025L, 1287026L, 1287027L, 1287030L, 1287027L
)), class = "data.frame", row.names = c(NA, -5L))
like image 139
akrun Avatar answered Jan 19 '23 08:01

akrun