Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter according to partial match of string variable in R

Tags:

r

dplyr

stringr

I have a data-frame with string variable column "disease". I want to filter the rows with partial match "trauma" or "Trauma". I am currently done the following using dplyr and stringr:

trauma_set <- df %>% filter(str_detect(disease, "trauma|Trauma"))

But the result also includes "Nontraumatic" and "nontraumatic". How can I filter only "trauma, Trauma, traumatic or Traumatic" without including nontrauma or Nontrauma? Also, is there a way I can define the string to detect without having to specify both uppercase and lowercase version of the string (as in both trauma and Trauma)?

like image 935
dc.tv Avatar asked Sep 06 '25 23:09

dc.tv


1 Answers

If we want to specify the word boundary, use \\b at the start. Also, for different cases, we can use ignore_case = TRUE by wrapping with modifiers

library(dplyr)
library(stringr)
out <- df %>%
        filter(str_detect(disease, regex("\\btrauma", ignore_case = TRUE)))

sum(str_detect(out$disease, regex("^Non", ignore_case = TRUE)))
#[1] 0

data

set.seed(24)
df <- data.frame(disease = sample(c("Nontraumatic", "Trauma", 
 "Traumatic", "nontraumatic", "traumatic", "trauma"), 50 ,
        replace = TRUE), value = rnorm (50))
like image 87
akrun Avatar answered Sep 11 '25 03:09

akrun