I am working with text data, and looking for a solution to a filtering problem.
I have managed to find a solution which filters for rows that contain 'Word 1' OR 'Word 2'
Here's the reproducible code
df=data.frame(UID=c(1,2,3,4,5),Text=c("the quick brown fox jumped over the lazy dog",
"long live the king",
"I love my dog a lot",
"Tomorrow will be a rainy day",
"Tomorrow will be a sunny day"))
#Filter for rows that contain "brown" OR "dog"
filtered_results_1=dplyr::filter(df, grepl('brown|dog', Text))
However when I filter for rows that contain both 'Word 1' AND 'Word 2', it doesn't work.
#Filter for rows that contain "brown" AND "dog"
filtered_results_2=dplyr::filter(df, grepl('brown & dog', Text))
Cannot figure out the right syntax for this one, any help would be appreciated.
You could use stringr::str_count
:
dplyr::mutate(df, test = stringr::str_count(Text,'brown|dog'))
# UID Text test
# 1 1 the quick brown fox jumped over the lazy dog 2
# 2 2 long live the king 0
# 3 3 I love my dog a lot 1
# 4 4 Tomorrow will be a rainy day 0
# 5 5 Tomorrow will be a sunny day 0
dplyr::filter(df, stringr::str_count(Text,'brown|dog') == 2)
# UID Text
# 1 1 the quick brown fox jumped over the lazy dog
It will count dog
or brown
as many times as they occur though
The following is more general, less elegant than some, but you can conveniently put the searched words in a vector :
dplyr::filter(df, purrr::map_int(strsplit(as.character(Text),'[[:punct:] ]'),
~sum(unique(.) %in% c("brown","dog"))) == 2)
# UID Text
# 1 1 the quick brown fox jumped over the lazy dog
We can use a double grepl
dplyr::filter(df, grepl('\\bbrown\\b', Text) & grepl('\\bdog\\b', Text))
or use a condition where we check the word 'brown' followed by the word 'dog' (note the word boundary (\\b
) to make sure that it won't match anything else) or 'dog' followed by 'brown'
dplyr::filter(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
# UID Text
#1 1 the quick brown fox jumped over the lazy dog
NOTE: It checks the word boundary, the words 'brown', 'dog', presence of both of them in the string
It can also be done with base R
subset(df, grepl("\\bbrown\\b.*\\bdog\\b|\\bdog\\b.*\\bbrown\\b", Text))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With