R filterings rows that contain a combination of words

Question

I am working with text data, and looking for a solution to a filtering problem.

I have managed to find a solution which filters for rows that contain 'Word 1' OR 'Word 2'

Here's the reproducible code

df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
                                 "long live the king",
                                 "I love my dog a lot",
                                 "Tomorrow will be a rainy day",
                                 "Tomorrow will be a sunny day"))


#Filter for rows that contain "brown" OR "dog"
filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))

However when I filter for rows that contain both 'Word 1' AND 'Word 2', it doesn't work.

#Filter for rows that contain "brown" AND "dog"
filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))

Cannot figure out the right syntax for this one, any help would be appreciated.

Moody_Mudskipper · Accepted Answer

You could use stringr::str_count :

dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
#   UID                                         Text test
# 1   1 the quick brown fox jumped over the lazy dog    2
# 2   2                           long live the king    0
# 3   3                          I love my dog a lot    1
# 4   4                 Tomorrow will be a rainy day    0
# 5   5                 Tomorrow will be a sunny day    0

dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog

It will count dog or brown as many times as they occur though

The following is more general, less elegant than some, but you can conveniently put the searched words in a vector :

dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
               ~sum(unique(.) %in% c("brown","dog"))) == 2)

#   UID                                         Text
# 1   1 the quick brown fox jumped over the lazy dog

akrun · Answer

We can use a double grepl

dplyr::filter(df, grepl('\bbrown\b', Text) & grepl('\bdog\b', Text))

or use a condition where we check the word 'brown' followed by the word 'dog' (note the word boundary (\b) to make sure that it won't match anything else) or 'dog' followed by 'brown'

dplyr::filter(df, grepl("\bbrown\b.*\bdog\b|\bdog\b.*\bbrown\b", Text))
#   UID                                         Text
#1   1 the quick brown fox jumped over the lazy dog

NOTE: It checks the word boundary, the words 'brown', 'dog', presence of both of them in the string

It can also be done with base R

subset(df, grepl("\bbrown\b.*\bdog\b|\bdog\b.*\bbrown\b", Text))

R filterings rows that contain a combination of words

Tags:

text

r

dplyr

filtering

Varun

2 Answers

Moody_Mudskipper

akrun

Recent Activity

Donate For Us

R filterings rows that contain a combination of words

Tags:

text

r

dplyr

filtering

Varun

2 Answers

Moody_Mudskipper

akrun

Related questions

Recent Activity

Donate For Us