Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(Skip)(Fail) Parse error with stringi

Tags:

regex

r

stringi

I was reading/learning The Greatest Regex Trick Ever where we say we want something unless...using (*SKIP)(*FAIL). OK so I took it for a spin on the toy example below and it works in base R but has the following error in stringi. Do I need to do something different with stringi to get the syntax to work?

x <- c("I shouldn't", "you should", "I know", "'bout time")
pat <- '(?:houl)(*SKIP)(*FAIL)|(ou)'

grepl(pat, x, perl = TRUE)
## [1] FALSE  TRUE FALSE  TRUE

stringi::stri_detect_regex(x, pat)
## Error in stringi::stri_detect_regex(x, pat) : 
##   Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
like image 508
Tyler Rinker Avatar asked Oct 31 '22 11:10

Tyler Rinker


1 Answers

The stringi module (and stringr as well) is bundled with the ICU regex library and (*SKIP)(*FAIL) verbs are not supported (they are actually only supported by PCRE library).

Since you are matching ou that are not preceded with h and not followed with l, you can use usual lookarounds:

(?<!h)ou(?!l)

See the regex demo

> x <- c("I shouldn't", "you should", "I know", "'bout time")
> pat1 <- "(?<!h)ou(?!l)"
> stringi::stri_detect_regex(x, pat1)
[1] FALSE  TRUE FALSE  TRUE

I can also suggest another approach here. Since your code implies you want to just return a boolean value indicating if there is ou inside a string but not houl, you may use

stringi::stri_detect_regex(x, "^(?!.*houl).*ou")

See another regex demo

Details

  • ^ - start of the string
  • (?!.*houl) - a negative lookahead that fails the match if right after the start of string there are 0+ chars other than line break chars as many as possible followed with houl
  • .*- 0+ chars other than line break chars as many as possible
  • ou - an ou substring.

More details on Lookahead and Lookbehind Zero-Length Assertions.

Note that in ICU a lookbehind cannot contain patterns of unknown width, however, limiting quantifiers inside lookbehinds are supported. So, in stringi, if you wanted to match any word containing ou that is not preceded with s somewhere to the left, you can use

> pat2 <- "(?<!s\\w{0,100})ou"
> stringi::stri_detect_regex(x, pat2)
[1] FALSE  TRUE FALSE  TRUE

Where (?<!s\\w{0,100}) constrained-width lookbehind fails the match if ou is preceded with s followed with 0 to 100 alphanumeric or underscore characters.

like image 75
Wiktor Stribiżew Avatar answered Nov 15 '22 05:11

Wiktor Stribiżew