I have a column with string content <pre class="prettyprint"><code>temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", "grocery offers today low price", "tide soap", "tide soap bar", "tide detergent powders 2kg", NA, "tide", "tide detergent powders 2kg", "liquid detergent tide brand") </code></pre> My intention is to create a bigram with words which are next to Tide. For this I would need to filter out words which are next to tide. Either left or right side. For ex in above output would be <pre class="prettyprint"><code>tide soap tide soap tide detergent tide detergent detergent tide tide brand </code></pre> Any help ?

If you use the quanteda package, this is straightforward. You specify which word you want to target and decide how many words on left/right side of the target you want. <pre class="prettyprint"><code>library(quanteda) kwic(x = temp, pattern = "tide", window = 1) %>% as.data.frame docname from to pre keyword post pattern 1 text7 1 1 tide soap tide 2 text8 1 1 tide soap tide 3 text9 1 1 tide detergent tide 4 text11 1 1 tide tide 5 text12 1 1 tide detergent tide 6 text13 3 3 detergent tide brand tide </code></pre>

Is this what you want? <pre class="prettyprint"><code>library(stringr) str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)") </code></pre> It basically says extract the strings that are either <code>"tide"</code> followed by a whitespace <code></code> and then a combination of letters and numbers (<code>[:alnum:]</code>) of any length (<code>*</code>) or (<code>|</code>) the other way around (<code>[:alnum:]* tide</code>). Btw: if you want to, afterwards you can remove the <code>NA</code>s with <pre class="prettyprint"><code>x <- str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)") x[!is.na(x)] </code></pre>

You can use the <code>tidytext</code> package to split the text into bigrams and filter for <code>tide</code>. <pre class="prettyprint"><code>library(tidytext) library(dplyr) library(tibble) temp %>% enframe(name = "id") %>% filter(str_detect(value, "tide")) %>% unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>% filter(str_detect(bigrams, "tide")) # A tibble: 6 x 2 id bigrams <int> <chr> 1 5 tide soap 2 6 tide soap 3 7 tide detergent 4 10 tide detergent 5 11 detergent tide 6 11 tide brand </code></pre>

Filter all rows with word next to a specified word in R

Tags:

r

tidyr

tidyverse

tidytext

I have a column with string content

temp <- c(NA, NA, "grocery pantry all offers", NA, "grocery offers today low price", 
"grocery offers today low price", "tide soap", "tide soap bar", 
"tide detergent powders 2kg", NA, "tide", "tide detergent powders 2kg", 
"liquid detergent tide brand")

My intention is to create a bigram with words which are next to Tide. For this I would need to filter out words which are next to tide. Either left or right side. For ex in above output would be

tide soap
tide soap
tide detergent
tide detergent
detergent tide
tide brand

Any help ?

312

asked Feb 13 '20 12:02

Vaibhav Singh

3 Answers

If you use the quanteda package, this is straightforward. You specify which word you want to target and decide how many words on left/right side of the target you want.

library(quanteda)

kwic(x = temp, pattern = "tide", window = 1) %>% 
as.data.frame

  docname from to       pre keyword      post pattern
1   text7    1  1              tide      soap    tide
2   text8    1  1              tide      soap    tide
3   text9    1  1              tide detergent    tide
4  text11    1  1              tide              tide
5  text12    1  1              tide detergent    tide
6  text13    3  3 detergent    tide     brand    tide

answered Oct 13 '22 11:10

jazzurro

Is this what you want?

library(stringr)

str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")

It basically says extract the strings that are either "tide" followed by a whitespace and then a combination of letters and numbers ([:alnum:]) of any length (*) or (|) the other way around ([:alnum:]* tide).

Btw: if you want to, afterwards you can remove the NAs with

x <- str_extract(temp, "(tide [:alnum:]*)|([:alnum:]* tide)")
x[!is.na(x)]

answered Oct 13 '22 11:10

Georgery

You can use the tidytext package to split the text into bigrams and filter for tide.

library(tidytext)
library(dplyr)
library(tibble)

temp %>% 
  enframe(name = "id") %>%
  filter(str_detect(value, "tide")) %>%
  unnest_tokens(bigrams, value, token = "ngrams", n = 2) %>%
  filter(str_detect(bigrams, "tide"))

# A tibble: 6 x 2
     id bigrams       
  <int> <chr>         
1     5 tide soap     
2     6 tide soap     
3     7 tide detergent
4    10 tide detergent
5    11 detergent tide
6    11 tide brand

answered Oct 13 '22 11:10

Ritchie Sacramento

Related questions
                            
                                Why is it valid to slice a vector starting with index zero?
                            
                                Top N rows by group using python datatable
                            
                                Saving H2o data frame
                            
                                How to specify a customized paper size in r markdown
                            
                                Remove trailing NA by group in a data.frame
                            
                                How to combine multiple character columns into one columns and remove NA without knowing column numbers
                            
                                Split line by multiple points using sf package
                            
                                xml_find_all function from xml2 package (R) does not find relevant nodes
                            
                                Lapply to a list of dataframes only if column exists
                            
                                R - How to both unlist and concatenate
                            
                                How do I use setwd in a relative way?
                            
                                Calculate relative change in time by group
                            
                                How to define a function in dplyr?
                            
                                How to remove function from list in R?
                            
                                Iterate over columns of a matrix in R
                            
                                Remove columns with NA's and/or Zeros Only
                            
                                How to delete rows for leading and trailing NAs by group in R
                            
                                How to order a list by a custom function, discarding duplicates?
                            
                                Count the number of values between value and value - x by variable
                            
                                Error: The animation object does not specify a save_animation method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter all rows with word next to a specified word in R

Tags:

r

tidyr

tidyverse

tidytext

Vaibhav Singh

People also ask

3 Answers

jazzurro

Georgery

Ritchie Sacramento

Recent Activity

Donate For Us