Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: workaround for variable-width lookbehind

Given this vector:

ba <- c('baa','aba','abba','abbba','aaba','aabba')'

I want to change the final a of each word to i except baa and aba.

I wrote the following line ...

gsub('(?<=a[ab]b{1,2})a','i',ba,perl=T)

but was told: PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')a'.

I looked around a little bit and apparently R/Perl can only lookahead for a variable width, not lookbehind. Any workaround to this problem? Thanks!

like image 702
dasf Avatar asked Mar 27 '15 19:03

dasf


2 Answers

You can use the lookbehind alternative \K instead. This escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included.

Quotedrexegg

The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before \K.

Using it in context:

sub('a[ab]b{1,2}\\Ka', 'i', ba, perl=T)
# [1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"

Avoiding lookarounds:

sub('(a[ab]b{1,2})a', '\\1i', ba)
# [1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"
like image 147
hwnd Avatar answered Oct 01 '22 09:10

hwnd


Another solution for the current case only, when the only quantifier used is a limiting quantifier, may be using stringr::str_replace_all / stringr::str_replace:

> library(stringr)
> str_replace_all(ba, '(?<=a[ab]b{1,2})a', 'i')
[1] "baa"   "aba"   "abbi"  "abbbi" "aabi"  "aabbi"

It works because stringr regex functions are based on ICU regex that features a constrained-width lookbehind:

The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)

So, you can't really use any kind of patterns inside ICU lookbehinds, but it is good to know you may use at least a limiting quantifier in it when you need to get overlapping texts within a known distance range.

like image 21
Wiktor Stribiżew Avatar answered Oct 01 '22 08:10

Wiktor Stribiżew