I have the following data.
stringstosearch <- c("to", "and", "at", "from", "is", "of")
set.seed(199)
datatxt <- data.frame(id = c(rnorm(5)),
x = c("Contrary to popular belief, Lorem Ipsum is not simply random text.",
"A Latin professor at Hampden-Sydney College in Virginia",
"It has roots in a piece of classical Latin ",
"literature from 45 BC, making it over 2000 years old.",
"The standard chunk of Lorem Ipsum used since"))
I want to search the keywords listed in stringtosearch and create columns for each with results.
I tried
library(stringr)
datatxt$result <- str_detect(datatxt$x, paste0(stringstosearch, collapse = '|'))
which returns
> datatxt$result
[1] TRUE TRUE TRUE TRUE TRUE
However, I am looking for an approach which creates a boolean vector for each word in stringstosearch, i.e.
id x to and at from is of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text. TRUE FALSE FALSE FALSE TRUE TRUE
2 0.5551667 A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE TRUE FALSE FALSE FALSE
3 -2.2163365 It has roots in a piece of classical Latin FALSE FALSE FALSE FALSE FALSE FALSE
4 0.4941455 literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE TRUE FALSE FALSE
5 -0.5805710 The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE FALSE
Any idea how to achieve this?
Here is a base R one-liner. Use sprintf() to add the \\b word boundary anchors to each pattern. This means that, for example, "and" will not match "random". Then iterate over these patterns with lapply(), using grepl() to match each pattern to datatxt$x. This returns a list of logical vectors, which we can assign back to the data frame.
datatxt[stringstosearch] <- lapply(
sprintf("\\b%s\\b", stringstosearch), \(x) grepl(x, datatxt$x)
)
Now datatxt is as desired:
id x to and at from is of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text. TRUE FALSE FALSE FALSE TRUE FALSE
2 0.5551667 A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE TRUE FALSE FALSE FALSE
3 -2.2163365 It has roots in a piece of classical Latin FALSE FALSE FALSE FALSE FALSE TRUE
4 0.4941455 literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE TRUE FALSE FALSE
5 -0.5805710 The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE TRUE
tidyverse approachAs you tagged tidyverse, here an alternative method. This returns the same list as the base R approach using tidyverse functions, except it's named. Then we can use the splice operator to pass this to dplyr::mutate() as new columns:
datatxt |>
dplyr::mutate(
!!!purrr::map(
purrr::set_names(
stringr::str_glue("\\b{stringstosearch}\\b"),
stringstosearch
),
\(str) stringr::str_detect(x, str)
)
)
# ^^ same output
I think the base R approach is much cleaner.
I suggest Vectoriz[e]ing the pattern-argument of stringfish::sf_grepl():
Vsf_grepl = Vectorize(\(pattern) stringfish::sf_grepl(datatxt$x, pattern))
datatxt[stringstosearch] = Vsf_grepl(sprintf("\\b%s\\b", stringstosearch))
gives
> datatxt
id x to and at from is of
1 1 Contrary to popular belief, Lorem Ipsum is not simply random text. TRUE FALSE FALSE FALSE TRUE FALSE
2 2 A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE TRUE FALSE FALSE FALSE
3 3 It has roots in a piece of classical Latin FALSE FALSE FALSE FALSE FALSE TRUE
4 4 literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE TRUE FALSE FALSE
5 5 The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE TRUE
Note, I changed id generation to id = 1:5.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With