I have a data.table. One of the columns is a list of words. I want to see if any of those words appear in another column, which is a single word, for each row. I feel like this should be easy but I am not getting the result I expect.
The difficulty seems to be the fact that the column includes lists, and possibly also that it is inconsistent (i.e. not lists all of the same length, some NAs, some that are just single words)?
Example data
words_data <- data.table(
word = c("Lots", "of", "words", "some", "are", "names",
"like", "Tom", "and", "Connolly", "or", "Pete", "Dawson"),
names = c(list(c("Tom", "Connolly")),
list(c("Tom", "Connolly")),
list(c("Tom", "Connolly")),
NA,
NA,
NA,
list(c("Tom", "Connolly")),
list(c("Tom", "Connolly", "Pete", "Dawson")),
list(c("Jenny", "Rogers")),
NA,
list(c("Pete", "Dawson")),
"Dawson",
NA)
)
Desired output
A data.table filtered to rows where the value in the word
column can be found in names
column.
Therefore the only one that would match in this particular dataset would be the 8th row, which has "Tom"
as the word and c("Tom", "Connolly", "Pete", "Dawson")
as the names.
Using %in%
This just returns one line, but I don't know why this line.
> words_data[word %in% names]
word names
1: Dawson NA
Using unlist()
This does identify that words that are names, so basically suggests that the entire names column is unlisted and all of the words checked against, which seems closer but I only want it to check for the row.
> words_data[word %in% unlist(names)]
word names
1: Tom Tom,Connolly,Pete,Dawson
2: Connolly NA
3: Pete Dawson
4: Dawson NA
Using sapply
I thought using sapply()
might help with the row-wise issue but the output is the same as just doing word %in% names
.
> words_data[word %in% sapply(names, unlist)]
word names
1: Dawson NA
This is essentially just a hidden loop, but it will work:
words_data[mapply(`%in%`, word, names)]
# word names
#1: Tom Tom,Connolly,Pete,Dawson
I thought it might scale terribly, but it is okay:
words_data <- words_data[rep(1:13,1e5),]
nrow(words_data)
#[1] 1300000
system.time(words_data[mapply(`%in%`, word, names)])
# user system elapsed
# 1.329 0.016 1.345
The issue with most of the attempts in the question is that they are not considering the word
and names
piece-by-piece in vectorised comparison across multiple vectors. Map
or mapply
will take care of this:
mapply(paste, 1:3, letters[1:3])
#[1] "1 a" "2 b" "3 c"
The reasons why the other results didn't work are varied. E.g.:
This compares each value of word
in turn to see if it exists in names
exactly
words_data$word %in% words_data$names
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[8] FALSE FALSE FALSE FALSE FALSE TRUE
"Dawson"
in row 13 of word
matches "Dawson"
in row 12 of names
. It won't match anything else that is a list
containing "Dawson"
along with other values though:
"Dawson" %in% list(list("Dawson","Tom"))
#[1] FALSE
"...basically suggests that the entire names
column is unlisted and all of the names word
checked against"
Yep, that's it.
The sapply
here didn't do anything to the names
object, because the unlist
is only run inside every list item anyway:
identical(words_data$names, sapply(words_data$names, unlist))
#[1] TRUE
Then you can reference the %in%
logic above for a reason as to why it didn't work as intended.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With