I'm trying to combine dplyr and stringr to detect multiple patterns in a dataframe. I want to use dplyr as I want to test a number of different columns.
Here's some sample data:
test.data <- data.frame(item = c("Apple", "Bear", "Orange", "Pear", "Two Apples"))
fruit <- c("Apple", "Orange", "Pear")
test.data
item
1 Apple
2 Bear
3 Orange
4 Pear
5 Two Apples
What I would like to use is something like:
test.data <- test.data %>% mutate(is.fruit = str_detect(item, fruit))
and receive
item is.fruit
1 Apple 1
2 Bear 0
3 Orange 1
4 Pear 1
5 Two Apples 1
A very simple test works
> str_detect("Apple", fruit)
[1] TRUE FALSE FALSE
> str_detect("Bear", fruit)
[1] FALSE FALSE FALSE
But I can't get this to work over the column of the dataframe, even without dplyr:
> test.data$is.fruit <- str_detect(test.data$item, fruit)
Error in check_pattern(pattern, string) :
Lengths of string and pattern not compatible
Does anyone know how to do this?
Detect one of the multiple strings in RYou can work with the multiple grepl functions. If you have a lot of patterns that you want to check, then it is better to use grepl with sapply and apply functions. Another approach in the detection of any of the strings is the usage of the OR operator in regex.
You can use the str_detect() function from the stringr function R to detect the presence or absence of a certain pattern in a string. This function returns TRUE if the pattern is present in the string or FALSE if it is not.
str_detect
only accepts a length-1 pattern. Either turn it into one regex using paste(..., collapse = '|')
or use any
:
sapply(test.data$item, function(x) any(sapply(fruit, str_detect, string = x)))
# Apple Bear Orange Pear Two Apples
# TRUE FALSE TRUE TRUE TRUE
str_detect(test.data$item, paste(fruit, collapse = '|'))
# [1] TRUE FALSE TRUE TRUE TRUE
This simple approach works fine for EXACT matches:
test.data %>% mutate(is.fruit = item %in% fruit)
# A tibble: 5 x 2
item is.fruit
<chr> <lgl>
1 Apple TRUE
2 Bear FALSE
3 Orange TRUE
4 Pear TRUE
5 Two Apples FALSE
This approach works for partial matching (which is the question asked):
test.data %>%
rowwise() %>%
mutate(is.fruit = sum(str_detect(item, fruit)))
Source: local data frame [5 x 2]
Groups: <by row>
# A tibble: 5 x 2
item is.fruit
<chr> <int>
1 Apple 1
2 Bear 0
3 Orange 1
4 Pear 1
5 Two Apples 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With