How to prevent regmatches drop non matches?

Question

I would like to capture the first match, and return NA if there is no match.

regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1]  1 -1  3  1
# attr(,"match.length")
# [1]  1 -1  1  2

x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1]  "a"  "a"  "aa"

So I expected "a", NA, "a", "aa"

thelatemail · Accepted Answer

Staying with regexpr:

r <- regexpr("a+", x)
out <- rep(NA,length(x))
out[r!=-1] <- regmatches(x, r)
out
#[1] "a"  NA   "a"  "aa"

Ricardo Saporta · Answer

use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting

 R <- regmatches(x, regexec("a+", x))
 unlist({R[sapply(R, length)==0] <- NA; R})

 # [1] "a"  NA   "a"  "aa"

lmo · Answer

In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says

if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).

The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.

myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] ""   "a"  "bc"

[[2]]
[1] "def"

[[3]]
[1] "cb" "a"  " a"

[[4]]
[1] ""   "aa" ""

So to extract what you want (with "" in place of NA), you can use sapply as follows:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a"  ""   "a"  "aa"

At this point, if you really want NA instead of "", you can use

is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a"  NA   "a"  "aa"

Some revisions:
Note that you can collapse the last two lines into a single line:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})

The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.

An even slicker extraction method for the final line is to use [:

sapply(myMatch, `[`, 2)
[1] "a"  NA   "a"  "aa"

So you can do the whole thing in a fairly readable single line:

sapply(regmatches(x, m, invert=NA), `[`, 2)

TheComeOnMan · Answer

Using more or less the same construction as yours -

chars <- c("abc", "def", "cba a", "aa")    

chars[
   regexpr("a+", chars, perl=TRUE) > 0
][1] #abc

chars[
   regexpr("q", chars, perl=TRUE) > 0
][1]  #NA

#vector[
#    find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]

Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.

How to prevent regmatches drop non matches?

Tags:

regex

r

colinfang

4 Answers

thelatemail

Ricardo Saporta

lmo

TheComeOnMan

Recent Activity

Donate For Us

How to prevent regmatches drop non matches?

Tags:

regex

r

colinfang

4 Answers

thelatemail

Ricardo Saporta

lmo

TheComeOnMan

Related questions

Recent Activity

Donate For Us