Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search multiple keywords over a column and create columns for each

Tags:

r

stringr

I have the following data.

stringstosearch <- c("to", "and", "at", "from", "is", "of")

set.seed(199)
datatxt <- data.frame(id = c(rnorm(5)), 
                       x = c("Contrary to popular belief, Lorem Ipsum is not simply random text.",
       "A Latin professor at Hampden-Sydney College in Virginia",
       "It has roots in a piece of classical Latin ", 
       "literature from 45 BC, making it over 2000 years old.", 
       "The standard chunk of Lorem Ipsum used since"))

I want to search the keywords listed in stringtosearch and create columns for each with results.

I tried

library(stringr)
datatxt$result <- str_detect(datatxt$x, paste0(stringstosearch, collapse = '|'))

which returns

> datatxt$result
[1] TRUE TRUE TRUE TRUE TRUE

However, I am looking for an approach which creates a boolean vector for each word in stringstosearch, i.e.

          id                                                                  x    to   and    at  from    is    of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text.  TRUE FALSE FALSE FALSE  TRUE  TRUE
2  0.5551667            A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE  TRUE FALSE FALSE FALSE
3 -2.2163365                        It has roots in a piece of classical Latin  FALSE FALSE FALSE FALSE FALSE FALSE
4  0.4941455              literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE  TRUE FALSE FALSE
5 -0.5805710                       The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE FALSE

Any idea how to achieve this?

like image 583
JontroPothon Avatar asked Nov 06 '25 06:11

JontroPothon


2 Answers

Here is a base R one-liner. Use sprintf() to add the \\b word boundary anchors to each pattern. This means that, for example, "and" will not match "random". Then iterate over these patterns with lapply(), using grepl() to match each pattern to datatxt$x. This returns a list of logical vectors, which we can assign back to the data frame.

datatxt[stringstosearch] <- lapply(
    sprintf("\\b%s\\b", stringstosearch), \(x) grepl(x, datatxt$x)
)

Now datatxt is as desired:

          id                                                                  x    to   and    at  from    is    of
1 -1.9091427 Contrary to popular belief, Lorem Ipsum is not simply random text.  TRUE FALSE FALSE FALSE  TRUE FALSE
2  0.5551667            A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE  TRUE FALSE FALSE FALSE
3 -2.2163365                        It has roots in a piece of classical Latin  FALSE FALSE FALSE FALSE FALSE  TRUE
4  0.4941455              literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE  TRUE FALSE FALSE
5 -0.5805710                       The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE  TRUE

tidyverse approach

As you tagged tidyverse, here an alternative method. This returns the same list as the base R approach using tidyverse functions, except it's named. Then we can use the splice operator to pass this to dplyr::mutate() as new columns:

datatxt |>
    dplyr::mutate(
        !!!purrr::map(
            purrr::set_names(
                stringr::str_glue("\\b{stringstosearch}\\b"),
                stringstosearch
            ),
            \(str) stringr::str_detect(x, str)
        )
    )
# ^^ same output

I think the base R approach is much cleaner.

like image 175
SamR Avatar answered Nov 09 '25 09:11

SamR


I suggest Vectoriz[e]ing the pattern-argument of stringfish::sf_grepl():

Vsf_grepl = Vectorize(\(pattern) stringfish::sf_grepl(datatxt$x, pattern))
datatxt[stringstosearch] = Vsf_grepl(sprintf("\\b%s\\b", stringstosearch))

gives

> datatxt
  id                                                                  x    to   and    at  from    is    of
1  1 Contrary to popular belief, Lorem Ipsum is not simply random text.  TRUE FALSE FALSE FALSE  TRUE FALSE
2  2            A Latin professor at Hampden-Sydney College in Virginia FALSE FALSE  TRUE FALSE FALSE FALSE
3  3                        It has roots in a piece of classical Latin  FALSE FALSE FALSE FALSE FALSE  TRUE
4  4              literature from 45 BC, making it over 2000 years old. FALSE FALSE FALSE  TRUE FALSE FALSE
5  5                       The standard chunk of Lorem Ipsum used since FALSE FALSE FALSE FALSE FALSE  TRUE

Note, I changed id generation to id = 1:5.

like image 30
Friede Avatar answered Nov 09 '25 08:11

Friede



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!