In R, I have two data frames that contain list columns
d1 <- data.table(
group_id1=1:4
)
d1$Cat_grouped <- list(letters[1:2],letters[3:2],letters[3:6],letters[11:12] )
And
d_grouped <- data.table(
group_id2=1:4
)
d_grouped$Cat_grouped <- list(letters[1:5],letters[6:10],letters[1:2],letters[1] )
I would like to merge these two data.tables based on the vectors in d1$Cat_grouped
being contained in the vectors in d_grouped$Cat_grouped
To be more precise, there could be two matching criteria:
a) all elements of each vector of d1$Cat_grouped
must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_a <- data.table(
group_id1=c(1,2)
group_id2=c(1,1)
)
b) at least one of the elements in each vector of d1$Cat_grouped
must be in the matched vector of d_grouped$Cat_grouped
Resulting in the following match:
result_b <- data.table(
group_id1=c(1,2,3,3),
group_id2=c(1,1,1,2)
)
How can I implement a) or b) ? Preferably in a data.table way.
EDIT1: added the expected results of a) and b)
EDIT2: added more groups to d_grouped, so grouping variables overlap. This breaks some of the proposed solutions
So I think long form is better, though my answer feels a little roundabout. I bet someone whose a little sleeker with data table can do this in fewer steps, but here's what I've got:
first, let's unpack the vectors in your example data:
d1_long <- d1[, list(cat=unlist(Cat_grouped)), group_id1]
d_grouped_long <- d_grouped[, list(cat=unlist(Cat_grouped)), group_id2]
Now, we can merge on the individual elements:
result_b <- merge(d1_long, d_grouped_long, by='cat')
Based on our example, it seems you don't actually need to know which elements were part of the match...
result_b[, cat := NULL]
Finally, my answer has duplicated group_id pairs because it gets a join for each pairwise match, not just the vector-level matches. So we can unique them away.
result_b <- unique(result_b)
Here's my result_b:
group_id.1 group_id.2
1: 1 1
2: 2 1
3: 3 1
4: 3 2
We can use b as an intermediate step to a, since having any elements in common is a subset of having all elements in common.
Let's merge the original tables to see what the candidates are in terms of subvectors and vectors
result_a <- merge(result_b, d1, by = 'group_id1')
result_a <- merge(result_a, d_grouped, by = 'group_id2')
So now, if the length of Cat_grouped.x matches the number of TRUEs about Cat_grouped.x being %in% Cat_grouped.y, that's a bingo.
I tried a handful of clean ways, but the weirdness of having lists in the data table defeated the most obvious attempts. This seems to work though:
Let's add a row
column to operate by
result_a[, row := 1:.N]
Now let's get the length and number of matches...
result_a[, x.length := length(Cat_grouped.x[[1]]), row]
result_a[, matches := sum(Cat_grouped.x[[1]] %in% Cat_grouped.y[[1]]), row]
And filter down to just rows where length and matches are the same
result_a <- result_a[x.length==matches]
This answer focuses on part a) of the question.
It follows Harland's approach but tries to make better use of the data.table
idiom for performance reasons as the OP has mentioned that his production data may contain millions of observations.
library(data.table)
d1 <- data.table(
group_id1 = 1:4,
Cat_grouped = list(letters[1:2], letters[3:2], letters[3:6], letters[11:12]))
d_grouped <- data.table(
group_id2 = 1:2,
Cat_grouped = list(letters[1:5], letters[6:10]))
grp_cols <- c("group_id1", "group_id2")
unique(d1[, .(unlist(Cat_grouped), lengths(Cat_grouped)), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, .(V2, .N), by = grp_cols][V2 == N, ..grp_cols])
group_id1 group_id2
1: 1 1
2: 2 1
While expanding the list elements of d1
and d_grouped
into long format, the number of list elements is determined for d1
using the lengths()
function. lengths()
(note the difference to length()
) gets the length of each element of a list and was introduced with R 3.2.0.
After the inner join (note the nomatch = 0L
parameter), the number of rows in the result set is counted (using the specal symbol .N
) for each combination of grp_cols
. Only those rows are considered where the count in the result set does match the original length of the list. Finally, the unique combinations of grp_cols
are returned.
Result b) can be derived from above solution by omitting the counting stuff:
unique(d1[, unlist(Cat_grouped), by = group_id1][
d_grouped[, unlist(Cat_grouped), by = group_id2], on = "V1", nomatch = 0L][
, c("group_id1", "group_id2")])
group_id1 group_id2 1: 1 1 2: 2 1 3: 3 1 4: 3 2
Another way:
Cross-join to get all pairs of group ids:
Y = CJ(group_id1=d1$group_id1, group_id2=d_grouped$group_id2)
Then merge in the vectors:
Y = Y[d1, on='group_id1'][d_grouped, on='group_id2']
# group_id1 group_id2 Cat_grouped i.Cat_grouped
# 1: 1 1 a,b a,b,c,d,e
# 2: 2 1 c,b a,b,c,d,e
# 3: 3 1 c,d,e,f a,b,c,d,e
# 4: 4 1 k,l a,b,c,d,e
# 5: 1 2 a,b f,g,h,i,j
# 6: 2 2 c,b f,g,h,i,j
# 7: 3 2 c,d,e,f f,g,h,i,j
# 8: 4 2 k,l f,g,h,i,j
Now you can use mapply
to filter however you like:
Y[mapply(function(u,v) all(u %in% v), Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
Y[mapply(function(u,v) length(intersect(u,v)) > 0, Cat_grouped, i.Cat_grouped), 1:2]
# group_id1 group_id2
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With