Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find matches of a vector of strings in another vector of strings

I'm trying to create a subset of a data frame of news articles that mention at least one element of a set of keywords or phrases.

# Sample data frame of articles
articles <- data.frame(id=c(1, 2, 3, 4), text=c("Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod", "tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,", "quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo", "consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse"))
articles$text <- as.character(articles$text)

# Sample vector of keywords or phrases
keywords <- as.character(c("elit", "tempor incididunt", "reprehenderit"))

#   id                                                                         text
# 1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
# 2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
# 3  3      quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
# 4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

Given the vector of keywords, the subset should contain rows 1, 2, and 4, since those rows contain one or more of the elements of the vector.

Neither %in nor grepl() work, since %in% seems to require that each word in the data frame be vectorized (articles$text %in% keywords results in four FALSEs), and grep() doesn't seem to be able to handle vectorized patterns (grep(keywords, articles$text) gives an error). Neither function alone seems to work well across multiple dimensions (i.e. it would be easy to search for one word in all the rows, but not all 3 at the same time).

What's the best way to find and select all rows of the data frame that contain at least one of the elements of the keyword vector?

like image 726
Andrew Avatar asked Jun 16 '13 04:06

Andrew


1 Answers

You can try pasting your "keywords" together and separate them with the pipe character (|) which will work like an "or", like this:

> articles[grepl(paste(keywords, collapse="|"), articles$text),]
  id                                                                         text
1  1     Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
2  2 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
4  4    consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
like image 71
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 06 '22 00:10

A5C1D2H2I1M1N2O1R2T1