Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find if list in data.table contains word in other column

Tags:

r

data.table

I have a data.table. One of the columns is a list of words. I want to see if any of those words appear in another column, which is a single word, for each row. I feel like this should be easy but I am not getting the result I expect.

The difficulty seems to be the fact that the column includes lists, and possibly also that it is inconsistent (i.e. not lists all of the same length, some NAs, some that are just single words)?

Example data

words_data <- data.table(
  word = c("Lots", "of", "words", "some", "are", "names", 
           "like", "Tom", "and", "Connolly", "or", "Pete", "Dawson"),
  names = c(list(c("Tom", "Connolly")),
            list(c("Tom", "Connolly")),
            list(c("Tom", "Connolly")),
            NA,
            NA,
            NA,
            list(c("Tom", "Connolly")),
            list(c("Tom", "Connolly", "Pete", "Dawson")),
            list(c("Jenny", "Rogers")),
            NA,
            list(c("Pete", "Dawson")),
            "Dawson",
            NA)
)

Desired output

A data.table filtered to rows where the value in the word column can be found in names column.

Therefore the only one that would match in this particular dataset would be the 8th row, which has "Tom" as the word and c("Tom", "Connolly", "Pete", "Dawson") as the names.

Using %in%

This just returns one line, but I don't know why this line.

> words_data[word %in% names]
     word names
1: Dawson    NA

Using unlist()

This does identify that words that are names, so basically suggests that the entire names column is unlisted and all of the words checked against, which seems closer but I only want it to check for the row.

> words_data[word %in% unlist(names)]
       word                    names
1:      Tom Tom,Connolly,Pete,Dawson
2: Connolly                       NA
3:     Pete                   Dawson
4:   Dawson                       NA

Using sapply

I thought using sapply() might help with the row-wise issue but the output is the same as just doing word %in% names.

> words_data[word %in% sapply(names, unlist)]
     word names
1: Dawson    NA
like image 897
Jaccar Avatar asked Mar 04 '23 01:03

Jaccar


1 Answers

This is essentially just a hidden loop, but it will work:

words_data[mapply(`%in%`, word, names)]
#   word                    names
#1:  Tom Tom,Connolly,Pete,Dawson

I thought it might scale terribly, but it is okay:

words_data <- words_data[rep(1:13,1e5),]
nrow(words_data)
#[1] 1300000
system.time(words_data[mapply(`%in%`, word, names)])
#   user  system elapsed 
#  1.329   0.016   1.345 

The issue with most of the attempts in the question is that they are not considering the word and names piece-by-piece in vectorised comparison across multiple vectors. Map or mapply will take care of this:

mapply(paste, 1:3, letters[1:3])
#[1] "1 a" "2 b" "3 c"

The reasons why the other results didn't work are varied. E.g.:

%in%

This compares each value of word in turn to see if it exists in names exactly

words_data$word %in% words_data$names
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[8] FALSE FALSE FALSE FALSE FALSE  TRUE

"Dawson" in row 13 of word matches "Dawson" in row 12 of names. It won't match anything else that is a list containing "Dawson" along with other values though:

"Dawson" %in% list(list("Dawson","Tom"))
#[1] FALSE

unlist

"...basically suggests that the entire names column is unlisted and all of the names word checked against"

Yep, that's it.

sapply + unlist

The sapply here didn't do anything to the names object, because the unlist is only run inside every list item anyway:

identical(words_data$names, sapply(words_data$names, unlist))
#[1] TRUE

Then you can reference the %in% logic above for a reason as to why it didn't work as intended.

like image 181
thelatemail Avatar answered Mar 07 '23 00:03

thelatemail