Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

match / find rows based on multiple required values in a single row in R

Tags:

dataframe

r

This must be a duplicate but I can't find it. So here goes.

I have a data.frame with two columns. One contains a group and the other contains a criterion. A group can contain many different criteria, but only one per row. I want to identify groups that contain three specific criteria (but that will appear on different rows. In my case, I want to identify all groups that contains the criteria "I","E","C". Groups may contain any number and combination of these and several other letters.

test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))

> test
  grp val
1    1   C
2    1   I
3    2   E
4    2   I
5    2   C
6    3   E
7    3   I
8    3   A
9    4   C
10   4   I
11   4   E
12   4   E
13   4   A

In the above example, I want to identify grp 2, and 4 because each of these contains the letters E, I, and C.

Thanks!

like image 307
Jordan Avatar asked Sep 23 '22 03:09

Jordan


2 Answers

Here's a dplyr solution. %in% is vectorized so c("E", "I", "C") %in% val returns a logical vector of length three. For the target groups, passing that vector to all() returns TRUE. That's our filter, and we run it within each group using group_by().

library(dplyr)
test %>% 
  group_by(grp) %>%
  filter(all(c("E", "I", "C") %in% val))
# Source: local data frame [8 x 2]
# Groups: grp [2]
# 
#     grp    val
#   (dbl) (fctr)
# 1     2      E
# 2     2      I
# 3     2      C
# 4     4      C
# 5     4      I
# 6     4      E
# 7     4      E
# 8     4      A

Or if this output would be handier (thanks @Frank),

test %>%
  group_by(grp) %>%
  summarise(matching = all(c("E", "I", "C") %in% val))
# Source: local data frame [4 x 2]
# 
#     grp matching
#   (dbl)    (lgl)
# 1     1    FALSE
# 2     2     TRUE
# 3     3    FALSE
# 4     4     TRUE
like image 64
effel Avatar answered Sep 28 '22 07:09

effel


library(data.table)

test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A"))

setDT(test)      # convert the data.frame into a data.table
group.counts <- dcast(test, grp ~ val)  # count number of same values per group and create one column per val with the count in the cell
group.counts[I>0 & E>0 & C>0,]          # now filtering is easy

Results in:

   grp A C E I
1:   2 0 1 1 1
2:   4 1 1 2 1

Instead of returning the group numbers only you could also "join" the resulting group numbers with the original data to show the "raw" data rows of each group that matches:

test[group.counts[I>0 & E>0 & C>0,], .SD, on="grp" ]

This shows:

   grp val
1:   2   E
2:   2   I
3:   2   C
4:   4   C
5:   4   I
6:   4   E
7:   4   E
8:   4   A

PS: Just to understand the solution easier: The counts for all groups are:

> group.counts
   grp A C E I
1:   1 0 1 0 1
2:   2 0 1 1 1
3:   3 1 0 1 1
4:   4 1 1 2 1
like image 36
R Yoda Avatar answered Sep 28 '22 07:09

R Yoda