Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge lists of vectors based on one vector belonging to another vector?

In R, I have two data frames that contain list columns

d1 <- data.table(
  group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )

And

d_grouped <- data.table(
  group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )

I would like to merge these two data.tables based on the vectors in d1$Cat_grouped being contained in the vectors in d_grouped$Cat_grouped

To be more precise, there could be two matching criteria:

a) all elements of each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped

Resulting in the following match:

result_a <- data.table(
   group_id1=c(1,2)
   group_id2=c(1,1)
)

b) at least one of the elements in each vector of d1$Cat_grouped must be in the matched vector of d_grouped$Cat_grouped

Resulting in the following match:

result_b <- data.table(
  group_id1=c(1,2,3,3),
  group_id2=c(1,1,1,2)
)

How can I implement a) or b) ? Preferably in a data.table way.

EDIT1: added the expected results of a) and b)

EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions

like image 362
LucasMation Avatar asked Jul 31 '17 03:07

LucasMation


3 Answers

So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:

first, let's unpack the vectors in your example data:

d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]

Now, we can merge on the individual elements:

result_b <- merge(d1_long, d_grouped_long, by='cat')

Based on our example, it seems you don't actually need to know which elements were part of the match...

result_b[, cat := NULL]

Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.

result_b <- unique(result_b)

Here's my result_b:

   group_id.1 group_id.2
1:          1          1
2:          2          1
3:          3          1
4:          3          2

We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.

Let's merge the original tables to see what the candidates are in terms of subvectors and vectors

result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')

So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.

I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:

Let's add a row column to operate by

result_a[, row := 1:.N]

Now let's get the length and number of matches...

result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]

And filter down to just rows where length and matches are the same

result_a <- result_a[x.length==matches]
like image 112
HarlandMason Avatar answered Oct 10 '22 19:10

HarlandMason


This answer focuses on part a) of the question.

It follows Harland's approach but tries to make better use of the data.table idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.

Sample data

library(data.table)
d1 <- data.table(
  group_id1 = 1:4,
  Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))

d_grouped <- data.table(
  group_id2 = 1:2,
  Cat_grouped = list(letters[1:5], letters[6:10]))

Result a)

grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
  d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
    , .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])

   group_id1 group_id2
1:         1         1
2:         2         1

Explanation

While expanding the list elements of d1 and d_grouped into long format, the number of list elements is determined for d1 using the lengths() function. lengths() (note the difference to length()) gets the length of each element of a list and was introduced with R 3.2.0.

After the inner join (note the nomatch = 0L parameter), the number of rows in the result set is counted (using the specal symbol .N) for each combination of grp_cols. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols are returned.

Result b)

Result b) can be derived from above solution by omitting the counting stuff:

unique(d1[, unlist(Cat_grouped), by = group_id1][
  d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
      , c("group_id1", "group_id2")])
   group_id1 group_id2
1:         1         1
2:         2         1
3:         3         1
4:         3         2
like image 44
Uwe Avatar answered Oct 10 '22 20:10

Uwe


Another way:

Cross-join to get all pairs of group ids:

Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)

Then merge in the vectors:

Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']

#    group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1:         1         1         a,b     a,b,c,d,e
# 2:         2         1         c,b     a,b,c,d,e
# 3:         3         1     c,d,e,f     a,b,c,d,e
# 4:         4         1         k,l     a,b,c,d,e
# 5:         1         2         a,b     f,g,h,i,j
# 6:         2         2         c,b     f,g,h,i,j
# 7:         3         2     c,d,e,f     f,g,h,i,j
# 8:         4         2         k,l     f,g,h,i,j

Now you can use mapply to filter however you like:

Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
#    group_id1 group_id2
# 1:         1         1
# 2:         2         1

Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
#    group_id1 group_id2
# 1:         1         1
# 2:         2         1
# 3:         3         1
# 4:         3         2
like image 27
sirallen Avatar answered Oct 10 '22 19:10

sirallen