Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R : How to search for a regex in a vector over elements outwardly?

Tags:

regex

r

Is it possible in R to search for a regex in a vector as if all the elements are a collapsed single element? If we collapse all the elements into one to do this, it becomes impossible to put them back to their element-wise form after the search.

here is a vector.

vector<-c("I", "met", "a", "cow")

now, the search word is "meta" (elements 2 and 3 collapsed).

Let's say my task is to merge the two elements across which the search string lies.

So what I expect is this:

vector = "I", "meta", "cow"

Is it possible to do this? Please help.

like image 528
jackson Avatar asked Sep 03 '12 16:09

jackson


3 Answers

If you'd like something that matches "meta" but not "taco", this will do the trick:

myFun <- function(vector, word) {
    D <- "UnLiKeLyStRiNg" 

    ## Construct a string on which you'll perform regex-search
    xx <- paste0(paste0(D, vector, collapse=""), D)

    ## Construct the regex pattern
    start <- paste0("(?<=", D, ")")
    mid <- paste0(strsplit(word, "")[[1]], collapse=paste0("(", D, ")?"))
    end <- paste0("(?=", D, ")")
    pat <- paste0(start, mid, end)

    ## Use it
    strsplit(gsub(pat, word, xx, perl=TRUE), D)[[1]][-1]
}

vector <- c("I", "met", "a", "cow")

myFun(vector, "meta")
# [1] "I"    "meta" "cow" 
myFun(vector, "taco")
# [1] "I"   "met" "a"   "cow"
myFun(vector, "Imet")
# [1] "Imet" "a"    "cow" 
myFun(vector, "Ime")
# [1] "I"   "met" "a"   "cow"
like image 69
Josh O'Brien Avatar answered Sep 23 '22 06:09

Josh O'Brien


If only complete elements should merged, you could try this approach:

mergeRegExpr <- function(x, pattern) {
    str <- paste(x, sep="", collapse="")

    ## find starting position of each word
    wordStart <- head(cumsum(c(1, nchar(x))), -1)

    ## look for pattern
    rx <- regexpr(pattern=pattern, text=str, fixed=TRUE)

    ## pos of matching pattern == rx+nchar(pattern)-1
    rxEnd <- rx+attr(rx, "match.length")-1

    ## which vector elements doesn't match pattern
    sel <- wordStart < rx | wordStart > rxEnd

    ## insert merged elements
    return(append(x[sel], paste(x[!sel], collapse=""), rx-1))
}

vector <- c("I", "met", "a", "cow")

mergeRegExpr(vector, "meta")
# "I"    "meta" "cow"
mergeRegExpr(vector, "acow")
# "I"    "met"  "acow"
mergeRegExpr(vector, "Imeta")
# "Imeta" "cow"

## partial matching doesn't work    
mergeRegExpr(vector, "taco")
# "I"       "metacow"
like image 39
sgibb Avatar answered Sep 19 '22 06:09

sgibb


Building on Carl Witthoft's comment, my solution was not with regex, but with basic matching:

# A slightly longer vector
v = c("I", "met", "a", "cow", "today",
      "You", "met", "a", "cow", "today")

# Create the combinations of each pair
temp1 = sapply(1:(length(v)-1), 
               function(x) paste0(v[x], v[x+1]))

# Grab the index of the desired search term
temp2 = which(temp1 %in% "meta")
# The following also works.
# Don't know what's faster/better.
# temp2 = grep("meta", temp1)

# Do some manual substitution and deletion
v[temp2] <- "meta"
v <- v[-(temp2+1)]

I don't think this is an ideal situation at all though.

like image 44
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 20 '22 06:09

A5C1D2H2I1M1N2O1R2T1