Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R filterings rows that contain a combination of words

I am working with text data, and looking for a solution to a filtering problem.

I have managed to find a solution which filters for rows that contain 'Word 1' OR 'Word 2'

Here's the reproducible code

df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
                                 "long live the king",
                                 "I love my dog a lot",
                                 "Tomorrow will be a rainy day",
                                 "Tomorrow will be a sunny day"))


#Filter for rows that contain "brown" OR "dog"
filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))

However when I filter for rows that contain both 'Word 1' AND 'Word 2', it doesn't work.

#Filter for rows that contain "brown" AND "dog"
filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))

Cannot figure out the right syntax for this one, any help would be appreciated.

like image 294
Varun Avatar asked Aug 31 '18 14:08

Varun


2 Answers

You could use stringr::str_count :

dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
#   UID                                         Text test
# 1   1 the quick brown fox jumped over the lazy dog    2
# 2   2                           long live the king    0
# 3   3                          I love my dog a lot    1
# 4   4                 Tomorrow will be a rainy day    0
# 5   5                 Tomorrow will be a sunny day    0

dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog

It will count dog or brown as many times as they occur though

The following is more general, less elegant than some, but you can conveniently put the searched words in a vector :

dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
               ~sum(unique(.) %in% c("brown","dog"))) == 2)

#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog
like image 115
Moody_Mudskipper Avatar answered Oct 03 '22 22:10

Moody_Mudskipper


We can use a double grepl

dplyr::filter(df, grepl('\\bbrown\\b', Text) & grepl('\\bdog\\b', Text))

or use a condition where we check the word 'brown' followed by the word 'dog' (note the word boundary (\\b) to make sure that it won't match anything else) or 'dog' followed by 'brown'

dplyr::filter(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
#   UID                                         Text
#1   1 the quick brown fox jumped over the lazy dog

NOTE: It checks the word boundary, the words 'brown', 'dog', presence of both of them in the string


It can also be done with base R

subset(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
like image 35
akrun Avatar answered Oct 03 '22 21:10

akrun