Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset and join a data frame by matching on nested list in R

I'm attempting to join two data frames, df and myData, according to elements in a column from each. The column in df purposefully contains nested lists, and I would like to join if an element in the nested list matches an element of myData. I'd like to keep unmatched rows in df (left join).

Here is an example, first without nested lists in df.

df = data.frame(a=1:5)
df$x1= c("a", "b", "g", "a", "a")
str(df)

'data.frame':   5 obs. of  2 variables:
$ a : int  1 2 3 4 5
$ x1: chr  "a" "b" "g" "a" ...

myData <- data.frame(x1=c("a", "g", "q"), x2= c("za", "zg", "zq"), stringsAsFactors = FALSE)

Now, we can join on column x1:

#using a for loop
df$x2 <- NA
for(id in 1:nrow(myData)){
  df$x2[df$x1 %in% myData$x1[id]] <- myData$x2[id]
}

Or using dplyr:

library(dplyr)
df = data.frame(a=1:5)
df$x1= c("a", "b", "g", "a", "a")
df %>%
  left_join(myData)

Now, consider df with nested lists.

l1 = list(letters[1:5])
l2 = list(letters[6:10])
df = data.frame(a=1:5)
df$x1= c("a", "b", "g", l1, l2)

Using a for loop fails to match on elements of a nested list, as we expect:

df$x2 <- NA
for(id in 1:nrow(myData)){
  df$x2[df$x1 %in% myData$x1[id]] <- myData$x2[id]
}

output:

df
  a            x1   x2
1 1             a   za
2 2             b <NA>
3 3             g   zg
4 4 a, b, c, d, e <NA>
5 5 f, g, h, i, j <NA>

Using dplyr:

df %>%
  left_join(myData)

throws an error:

Joining by: c("x1", "x2")
Error: cannot join on column 'x1'

I think the solution needs to unlist the nested lists, but haven't sorted out how to work the unlist function into the above strategies.

I've also tried the above with the data.table package. How to accomplish this with data.table is may be an additional question. But, to the extent the data.table handles lists within data frames, I wanted to include it, as it may provide the best solution.

My actual data is about 100,000 rows, so the matching on lists with base R could be a performance annoyance (another reason to consider data.table ?)

Fwiw, the use of nested lists (and other structures) within data frames is something I would often do in Python, and it may be there is a better way to structure the data in the first place in R.

Thoughts?

like image 230
bassounds Avatar asked Mar 19 '23 14:03

bassounds


1 Answers

Here is a possible solution:

df$x2 <- NA
for(id in 1:nrow(df)) 
  {
  df$x2[id] <- ifelse(
    length(ff <- myData$x2[which(myData$x1 == intersect(df$x1[[id]], myData$x1))])==0, 
    NA, 
    ff)
  }

df
#  a            x1   x2
#1 1             a   za
#2 2             b <NA>
#3 3             g   zg
#4 4 a, b, c, d, e   za
#5 5 f, g, h, i, j   zg

There are some potential pitfalls with the above solution. For example, if we change l1 to have 2 possible matches (e.g. "a" and "g") :

l1 = list(letters[1:7])
df$x1= c("a", "b", "g", l1, l2)

This solution will not catch both matches, as is:

df$x2 <- NA
    for(id in 1:nrow(df)) 
      {
      df$x2[id] <- ifelse(
        length(ff <- myData$x2[which(myData$x1 == intersect(df$x1[[id]], myData$x1))])==0, 
        NA, 
        ff)
      }
Warning message:
In myData$x1 == intersect(df$x1[[id]], myData$x1) :
  longer object length is not a multiple of shorter object length

You could modify it to allow multiple matches, if needed. Here are two different ways to do that, one uses paste and one uses list in the way you did in the problem.

df$x2 <- NA
    for(id in 1:nrow(df)) 
      {
      df$x2[id] <- 
        paste(if (length(ff <- myData$x2[which(myData$x1 %in% intersect(df$x1[[id]], myData$x1))])==0)
        NA else
        ff, collapse=", ")
      }


df$x2 <- NA
    for(id in 1:nrow(df)) 
      {
      df$x2[id] <- 
        list(if (length(ff <- myData$x2[which(myData$x1 %in% intersect(df$x1[[id]], myData$x1))])==0)
        NA else
        ff)
      }

Both will return the following, but the underlying structure will be different:

  a                  x1     x2
1 1                   a     za
2 2                   b     NA
3 3                   g     zg
4 4 a, b, c, d, e, f, g za, zg
5 5       f, g, h, i, j     zg
like image 58
Jota Avatar answered Apr 06 '23 03:04

Jota