Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter all rows with word next to a specified word in R

I have a column with string content

temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", 
"grocery offers today low price", "tide soap", "tide soap bar", 
"tide detergent powders 2kg", NA, "tide", "tide detergent powders 2kg", 
"liquid detergent tide brand")

My intention is to create a bigram with words which are next to Tide. For this I would need to filter out words which are next to tide. Either left or right side. For ex in above output would be

tide soap
tide soap
tide detergent
tide detergent
detergent tide
tide brand

Any help ?

like image 312
Vaibhav Singh Avatar asked Feb 13 '20 12:02

Vaibhav Singh


People also ask

How do I filter rows containing certain text in R?

Often you may want to filter rows in a data frame in R that contain a certain string. Fortunately this is easy to do using the filter() function from the dplyr package and the grepl() function in Base R.

How do I filter not in R?

How to Use “not in” operator in Filter, To filter for rows in a data frame that is not in a list of values, use the following basic syntax in dplyr. df %>% filter(! col_name %in% c('value1', 'value2', 'value3', ...)) df %>% filter(!


3 Answers

If you use the quanteda package, this is straightforward. You specify which word you want to target and decide how many words on left/right side of the target you want.

library(quanteda)

kwic(x = temp, pattern = "tide", window = 1) %>% 
as.data.frame

  docname from to       pre keyword      post pattern
1   text7    1  1              tide      soap    tide
2   text8    1  1              tide      soap    tide
3   text9    1  1              tide detergent    tide
4  text11    1  1              tide              tide
5  text12    1  1              tide detergent    tide
6  text13    3  3 detergent    tide     brand    tide
like image 67
jazzurro Avatar answered Oct 13 '22 11:10

jazzurro


Is this what you want?

library(stringr)

str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")

It basically says extract the strings that are either "tide" followed by a whitespace and then a combination of letters and numbers ([:alnum:]) of any length (*) or (|) the other way around ([:alnum:]* tide).

Btw: if you want to, afterwards you can remove the NAs with

x <- str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")
x[!is.na(x)]
like image 2
Georgery Avatar answered Oct 13 '22 11:10

Georgery


You can use the tidytext package to split the text into bigrams and filter for tide.

library(tidytext)
library(dplyr)
library(tibble)

temp %>% 
  enframe(name = "id") %>%
  filter(str_detect(value, "tide")) %>%
  unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>%
  filter(str_detect(bigrams, "tide"))

# A tibble: 6 x 2
     id bigrams       
  <int> <chr>         
1     5 tide soap     
2     6 tide soap     
3     7 tide detergent
4    10 tide detergent
5    11 detergent tide
6    11 tide brand  
like image 2
Ritchie Sacramento Avatar answered Oct 13 '22 11:10

Ritchie Sacramento