Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning groups using grepl with multiple inputs

Tags:

regex

r

I have a dataframe:

df <- data.frame(name=c("john", "david", "callum", "joanna", "allison", "slocum", "lisa"), id=1:7)
df

     name id
1    john  1
2   david  2
3  callum  3
4  joanna  4
5 allison  5
6  slocum  6
7    lisa  7

I have a vector containing regex that I wish to find in the df$name variable:

vec <- c("lis", "^jo", "um$")

The output I want to get is as follows:

     name id group
1    john  1     2
2   david  2    NA
3  callum  3     3
4  joanna  4     2
5 allison  5     1
6  slocum  6     3
7    lisa  7     1

I could do this doing the following:

df$group <- ifelse(grepl("lis", df$name), 1,
              ifelse(grepl("^jo", df$name), 2,
               ifelse(grepl("um$", df$name), 3,
                 NA)

However, I want to do this directly from 'vec'. I am generating different values into vec reactively in a shiny app. Can I assign groups based on index in vec?

Further, if something like the below happens, the group should be the first appearing. e.g. 'Callum' is TRUE for 'all' and "um$" but should get a group 1 here.

vec <- c("all", "^jo", "um$")
like image 287
jalapic Avatar asked Jan 10 '16 06:01

jalapic


2 Answers

Here are several options:

df$group <- apply(Vectorize(grepl, "pattern")(vec, df$name),
                  1,
                  function(ii) which(ii)[1])
#     name id group
#1    john  1     2
#2   david  2    NA
#3  callum  3     3
#4  joanna  4     2
#5 allison  5     1
#6  slocum  6     3
#7    lisa  7     1

Use a named vector and merge on it:

names(vec) <- seq_along(vec)

df <- merge(df, stack(Vectorize(grep, "pattern", SIMPLIFY=FALSE)(vec, df$name)),
 by.x="id", by.y="values", all.x = TRUE)

df[!duplicated(df$id),] # to keep only the first match
#  id    name  ind
#1  1    john    2
#2  2   david <NA>
#3  3  callum    3
#4  4  joanna    2
#5  5 allison    1
#6  6  slocum    3
#7  7    lisa    1

A for loop:

df$group <- NA

for ( i in rev(seq_along(vec))) {
  TFvec <- grepl(vec[i], df$name)
  df$group[TFvec] <- i
}

df
#     name id group
#1    john  1     2
#2   david  2    NA
#3  callum  3     3
#4  joanna  4     2
#5 allison  5     1
#6  slocum  6     3
#7    lisa  7     1

Or you can use outer with stri_match_first_regex from stringi

library(stringi)
match.mat <- outer(df$name, vec, stri_match_first_regex)
df$group <- apply(match.mat, 1, function(ii) which(!is.na(ii))[1]) 
# [1] for first match in `vec`

#     name id group
#1    john  1     2
#2   david  2    NA
#3  callum  3     3
#4  joanna  4     2
#5 allison  5     1
#6  slocum  6     3
#7    lisa  7     1
like image 155
Jota Avatar answered Sep 28 '22 05:09

Jota


A vectorised solution, using rebus and stringi.

library(rebus)
library(stringi)

Create a regular expression that captures any of the values in vec.

vec <- c("lis", "^jo", "um$")
(rx <- or1(vec, capture = TRUE))
## <regex> (lis|^jo|um$)

Match the regex, then convert to factor and integer.

matches <- stri_match_first_regex(df$name, rx)[, 2]
df$group <- as.integer(factor(matches, levels = c("lis", "jo", "um")))

df now looks like this:

     name id group
1    john  1     2
2   david  2    NA
3  callum  3     3
4  joanna  4     2
5 allison  5     1
6  slocum  6     3
7    lisa  7     1
like image 30
Richie Cotton Avatar answered Sep 28 '22 05:09

Richie Cotton