I have a column with string content
temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price",
"grocery offers today low price", "tide soap", "tide soap bar",
"tide detergent powders 2kg", NA, "tide", "tide detergent powders 2kg",
"liquid detergent tide brand")
My intention is to create a bigram with words which are next to Tide. For this I would need to filter out words which are next to tide. Either left or right side. For ex in above output would be
tide soap
tide soap
tide detergent
tide detergent
detergent tide
tide brand
Any help ?
Often you may want to filter rows in a data frame in R that contain a certain string. Fortunately this is easy to do using the filter() function from the dplyr package and the grepl() function in Base R.
How to Use “not in” operator in Filter, To filter for rows in a data frame that is not in a list of values, use the following basic syntax in dplyr. df %>% filter(! col_name %in% c('value1', 'value2', 'value3', ...)) df %>% filter(!
If you use the quanteda package, this is straightforward. You specify which word you want to target and decide how many words on left/right side of the target you want.
library(quanteda)
kwic(x = temp, pattern = "tide", window = 1) %>%
as.data.frame
docname from to pre keyword post pattern
1 text7 1 1 tide soap tide
2 text8 1 1 tide soap tide
3 text9 1 1 tide detergent tide
4 text11 1 1 tide tide
5 text12 1 1 tide detergent tide
6 text13 3 3 detergent tide brand tide
Is this what you want?
library(stringr)
str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")
It basically says extract the strings that are either "tide"
followed by a whitespace and then a combination of letters and numbers (
[:alnum:]
) of any length (*
) or (|
) the other way around ([:alnum:]* tide
).
Btw: if you want to, afterwards you can remove the NA
s with
x <- str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")
x[!is.na(x)]
You can use the tidytext
package to split the text into bigrams and filter for tide
.
library(tidytext)
library(dplyr)
library(tibble)
temp %>%
enframe(name = "id") %>%
filter(str_detect(value, "tide")) %>%
unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>%
filter(str_detect(bigrams, "tide"))
# A tibble: 6 x 2
id bigrams
<int> <chr>
1 5 tide soap
2 6 tide soap
3 7 tide detergent
4 10 tide detergent
5 11 detergent tide
6 11 tide brand
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With