Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how do you create regex to subset data frame based on some search strings?

Tags:

regex

r

I am trying to search for strings to subset the data frame. My df looks like this:

dput(df)
structure(list(Cause = structure(c(2L, 1L), .Label = c("jasper not able to read the property table after the release", 
"More than 7000  messages loaded which stuck up"), class = "factor"), 
    Resolution = structure(1:2, .Label = c("jobs and reports are processed", 
    "Updated the property table which resolved the issue."), class = "factor")), .Names = c("Cause", 
"Resolution"), class = "data.frame", row.names = c(NA, -2L))

I am trying to do this:

df1<-subset(df, grepl("*MQ*|*queue*|*Queue*", df$Cause))

searching for MQ or queue or Queue in the Cause column, subset the data frame df with matched records. It does not seem to be working, it is catching other records that MQ, queue or Queue string are not present.

Is this how you do this, any other ideas I can follow?

like image 956
user1471980 Avatar asked Mar 13 '23 10:03

user1471980


2 Answers

The regexp below seems to work. I have added a line to your data.frame so that it's a more interesting example.

I think the problem came from *s in your regexp, also added braces to define groups for the | but don't think it's mandatory here.

df <- data.frame(Cause=c("jasper not able to read the property table after the release", 
                         "More than 7000  messages loaded which stuck up",
                         "blabla Queue blabla"),
                 Resolution = c("jobs and reports are processed", 
                                "Updated the property table which resolved the issue.",
                                "hop"))

> head(df)
Cause                                           Resolution
1 jasper not able to read the property table after the release                       jobs and reports are processed
2               More than 7000  messages loaded which stuck up Updated the property table which resolved the issue.
3                                          blabla Queue blabla                                                  hop

> subset(df, grepl("(MQ)|(queue)|(Queue)", df$Cause))
Cause Resolution
3 blabla Queue blabla        hop

Is this what you wanted?

like image 177
Vincent Bonhomme Avatar answered Apr 30 '23 18:04

Vincent Bonhomme


Transferred from comments:

subset(df, grepl("MQ|Queue|queue", Cause))

or if any case is acceptable then:

subset(df, grepl("mq|queue", Cause, ignore.case = TRUE))

To get more information try ?regex and ?grepl from within R.

like image 29
G. Grothendieck Avatar answered Apr 30 '23 20:04

G. Grothendieck