Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify and tally intersection items in R

I have a data frame which shows membership in three color classes. Numbers refer to unique IDs. One ID may be a part of one group or multiple groups.

dat <- data.frame(BLUE = c(1, 2, 3, 4, 6, NA),
                  RED = c(2, 3, 6, 7, 9, 13),
                  GREEN = c(4, 6, 8, 9, 10, 11))

or for visual reference:

BLUE  RED  GREEN
1     2    4
2     3    6
3     6    8
4     7    9
6     9    10
NA    13   11

I need to identify and tally individual and cross group membership (i.e. how many IDs were only in red, how many were in both red and blue, etc.) My desired output is below. Please note that the IDs column is simply for reference, that column would not be in the expected output.

COLOR                TOTAL  IDs (reference only, not needed in final output)
RED                  2      (7, 13)
BLUE                 1      (1)
GREEN                3      (8, 10, 11)
RED, BLUE            3      (2, 3, 6)
RED, GREEN           2      (6, 9)
BLUE, GREEN          2      (4, 6)
RED, BLUE, GREEN     1      (6)

Does anyone know an efficient way to do this in R? Thanks!

like image 450
DJC Avatar asked Jan 25 '23 18:01

DJC


1 Answers

You can use the venn library (especially suited for situations when you do not have NAs in your data):

venn_table <- venn(as.list(dat))

               BLUE RED GREEN counts
                  0   0     0      0
GREEN             0   0     1      3
RED               0   1     0      2
RED:GREEN         0   1     1      1
BLUE              1   0     0      2
BLUE:GREEN        1   0     1      1
BLUE:RED          1   1     0      2
BLUE:RED:GREEN    1   1     1      1

And:

attr(venn_table, "intersections")

$GREEN
[1]  8 10 11

$RED
[1]  7 13

$`RED:GREEN`
[1] 9

$BLUE
[1]  1 NA

$`BLUE:GREEN`
[1] 4

$`BLUE:RED`
[1] 2 3

$`BLUE:RED:GREEN`
[1] 6

To include also the IDs:

data.frame(venn_table[2:nrow(venn_table), ],
           ID = do.call("rbind", lapply(attr(venn_table, "intersections"), paste0, collapse = ",")))

               BLUE RED GREEN counts      ID
GREEN             0   0     1      3 8,10,11
RED               0   1     0      2    7,13
RED:GREEN         0   1     1      1       9
BLUE              1   0     0      2    1,NA
BLUE:GREEN        1   0     1      1       4
BLUE:RED          1   1     0      2     2,3
BLUE:RED:GREEN    1   1     1      1       6

One way to deal with the the NAs:

venn_table2 <- data.frame(venn_table[2:nrow(venn_table), length(venn_table), drop = FALSE],
                          ID = do.call("rbind", lapply(attr(venn_table, "intersections"), paste0, collapse = ",")))

counts <- venn_table2[1] - with(venn_table2, lengths(regmatches(ID, gregexpr("NA", ID))))

               counts
GREEN               3
RED                 2
RED:GREEN           1
BLUE                1
BLUE:GREEN          1
BLUE:RED            2
BLUE:RED:GREEN      1

And a more elegant way to deal with the NAs could be (based on a comment from @M--):

print(venn(Map(function(x) x[!is.na(x)], as.list(dat))))

               BLUE RED GREEN counts
                  0   0     0      0
GREEN             0   0     1      3
RED               0   1     0      2
RED:GREEN         0   1     1      1
BLUE              1   0     0      1
BLUE:GREEN        1   0     1      1
BLUE:RED          1   1     0      2
BLUE:RED:GREEN    1   1     1      1
like image 62
tmfmnk Avatar answered Feb 11 '23 23:02

tmfmnk