Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R return true or false per row if string contains any of a list of words

I have a dataset containing a column of character strings:

text <- c('flight cancelled','dog cat','coach travel','car bus','cow sheep',' high bar')
transport <- 0

 df <- data.frame(text,transport)

For each row I want to return 1 if the string 'text' contains any of several words or 0 otherwise. My problem is that the only way I can think to do this is using a for loop. Is there a more efficient way of doing this? My dataset is quite large, so the for loop takes forever to run

words<- 'flight|flights|plane|seats|seat|travel|time|coach'

for (i in 1:6){
   df$transport[i] <- ifelse(any(grepl(words,(str_split(as.character(df$text[i]), " ")))) == TRUE,1,0)
 }

returns:

              text transport
1 flight cancelled         1
2          dog cat         0
3     coach travel         1
4          car bus         0
5        cow sheep         0
6         high bar         0
like image 804
nogbad Avatar asked Jul 17 '19 10:07

nogbad


2 Answers

You can use words and df$text direct in grep to find the lines which you want to set to 1.

df$transport[grep(words, df$text)] <- 1

Another way is using grepl and use + to get 0 and 1:

+grepl(words, df$text)
#[1] 1 0 1 0 0 0

In case only whole words should be matched they need to be surrounded with \b to match boundaries.

+grepl(paste0("\\b(", words, ")\\b"), df$text)
#[1] 1 0 1 0 0 0

Benchmark:

bench::mark(
         grepl = +grepl(words, df$text)
       , "grepl\\b" = +grepl(paste0("\\b(", words, ")\\b"), df$text)
       , greplPerl = +grepl(words, df$text, perl = TRUE)
       , stringr = +stringr::str_detect(df$text, words)
       , stringi = +stringi::stri_detect_regex(df$text, words)
       , like = +data.table::like(df$text, words)
       )
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 grepl      10.61µs 11.61µs    62577.        0B     6.26  9999     1    159.8ms
#2 grepl\b    15.29µs 16.31µs    59343.        0B    11.9   9998     2    168.5ms
#3 greplPerl    5.5µs   5.9µs   164148.        0B     0    10000     0     60.9ms
#4 stringr    10.01µs 10.78µs    88661.        0B    17.7   9998     2    112.8ms
#5 stringi     7.48µs  7.93µs   123578.        0B    12.4   9999     1     80.9ms
#6 like       11.83µs 12.66µs    77189.        0B     7.72  9999     1    129.5ms

In this case using grepl from base when setting perl = TRUE is the fastest method.

like image 51
GKi Avatar answered Oct 01 '22 02:10

GKi


If you are looking for speed, stringr or stringi functions usually outperform base functions:

library(stringr)

as.integer(str_detect(df$text, words))
[1] 1 0 1 0 0 0

EDIT: one more note, consider using word boundaries so that you do not get partial matches (e.g., flight matching for the word flights)

paste0("\\b", gsub("|", "\\b|\\b", words, fixed = T), "\\b") 
[1] "\\bflight\\b|\\bflights\\b|\\bplane\\b|\\bseats\\b|\\bseat\\b|\\btravel\\b|\\btime\\b|\\bcoach\\b"
like image 43
Andrew Avatar answered Oct 01 '22 00:10

Andrew