I'm trying to combine dplyr and stringr to detect multiple patterns in a dataframe. I want to use dplyr as I want to test a number of different columns.
Here's some sample data:
test.data <- data.frame(item = c("Apple", "Bear", "Orange", "Pear", "Two Apples"))
fruit <- c("Apple", "Orange", "Pear")
test.data
        item
1      Apple
2       Bear
3     Orange
4       Pear
5 Two Apples
What I would like to use is something like:
test.data <- test.data %>% mutate(is.fruit = str_detect(item, fruit))
and receive
        item is.fruit
1      Apple        1
2       Bear        0
3     Orange        1
4       Pear        1
5 Two Apples        1
A very simple test works
> str_detect("Apple", fruit)
[1]  TRUE FALSE FALSE
> str_detect("Bear", fruit)
[1] FALSE FALSE FALSE
But I can't get this to work over the column of the dataframe, even without dplyr:
> test.data$is.fruit <- str_detect(test.data$item, fruit)
Error in check_pattern(pattern, string) : 
  Lengths of string and pattern not compatible
Does anyone know how to do this?
Detect one of the multiple strings in RYou can work with the multiple grepl functions. If you have a lot of patterns that you want to check, then it is better to use grepl with sapply and apply functions. Another approach in the detection of any of the strings is the usage of the OR operator in regex.
You can use the str_detect() function from the stringr function R to detect the presence or absence of a certain pattern in a string. This function returns TRUE if the pattern is present in the string or FALSE if it is not.
str_detect only accepts a length-1 pattern. Either turn it into one regex using paste(..., collapse = '|') or use any:
sapply(test.data$item, function(x) any(sapply(fruit, str_detect, string = x)))
# Apple       Bear     Orange       Pear Two Apples
#  TRUE      FALSE       TRUE       TRUE       TRUE
str_detect(test.data$item, paste(fruit, collapse = '|'))
# [1]  TRUE FALSE  TRUE  TRUE  TRUE
                        This simple approach works fine for EXACT matches:
test.data %>% mutate(is.fruit = item %in% fruit)
# A tibble: 5 x 2
        item is.fruit
       <chr>    <lgl>
1      Apple     TRUE
2       Bear    FALSE
3     Orange     TRUE
4       Pear     TRUE
5 Two Apples    FALSE
This approach works for partial matching (which is the question asked):
test.data %>% 
rowwise() %>% 
mutate(is.fruit = sum(str_detect(item, fruit)))
Source: local data frame [5 x 2]
Groups: <by row>
# A tibble: 5 x 2
        item is.fruit
       <chr>    <int>
1      Apple        1
2       Bear        0
3     Orange        1
4       Pear        1
5 Two Apples        1
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With