I have a dataset containing a column of character strings:
text <- c('flight cancelled','dog cat','coach travel','car bus','cow sheep',' high bar')
transport <- 0
df <- data.frame(text,transport)
For each row I want to return 1 if the string 'text' contains any of several words or 0 otherwise. My problem is that the only way I can think to do this is using a for loop. Is there a more efficient way of doing this? My dataset is quite large, so the for loop takes forever to run
words<- 'flight|flights|plane|seats|seat|travel|time|coach'
for (i in 1:6){
df$transport[i] <- ifelse(any(grepl(words,(str_split(as.character(df$text[i]), " ")))) == TRUE,1,0)
}
returns:
text transport
1 flight cancelled 1
2 dog cat 0
3 coach travel 1
4 car bus 0
5 cow sheep 0
6 high bar 0
You can use words
and df$text
direct in grep
to find the lines which you want to set to 1.
df$transport[grep(words, df$text)] <- 1
Another way is using grepl
and use +
to get 0
and 1
:
+grepl(words, df$text)
#[1] 1 0 1 0 0 0
In case only whole words should be matched they need to be surrounded with \b
to match boundaries.
+grepl(paste0("\\b(", words, ")\\b"), df$text)
#[1] 1 0 1 0 0 0
Benchmark:
bench::mark(
grepl = +grepl(words, df$text)
, "grepl\\b" = +grepl(paste0("\\b(", words, ")\\b"), df$text)
, greplPerl = +grepl(words, df$text, perl = TRUE)
, stringr = +stringr::str_detect(df$text, words)
, stringi = +stringi::stri_detect_regex(df$text, words)
, like = +data.table::like(df$text, words)
)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 grepl 10.61µs 11.61µs 62577. 0B 6.26 9999 1 159.8ms
#2 grepl\b 15.29µs 16.31µs 59343. 0B 11.9 9998 2 168.5ms
#3 greplPerl 5.5µs 5.9µs 164148. 0B 0 10000 0 60.9ms
#4 stringr 10.01µs 10.78µs 88661. 0B 17.7 9998 2 112.8ms
#5 stringi 7.48µs 7.93µs 123578. 0B 12.4 9999 1 80.9ms
#6 like 11.83µs 12.66µs 77189. 0B 7.72 9999 1 129.5ms
In this case using grepl
from base when setting perl = TRUE
is the fastest method.
If you are looking for speed, stringr
or stringi
functions usually outperform base functions:
library(stringr)
as.integer(str_detect(df$text, words))
[1] 1 0 1 0 0 0
EDIT: one more note, consider using word boundaries so that you do not get partial matches (e.g., flight
matching for the word flights
)
paste0("\\b", gsub("|", "\\b|\\b", words, fixed = T), "\\b")
[1] "\\bflight\\b|\\bflights\\b|\\bplane\\b|\\bseats\\b|\\bseat\\b|\\btravel\\b|\\btime\\b|\\bcoach\\b"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With