Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

boost::regex - \bb?

I have some badly commented legacy code here that makes use of boost::regex::perl. I was wondering about one particular construct before, but since the code worked (more or less), I was loath to touch it.

Now I have to touch it, for technical reasons (more precisely, current versions of Boost no longer accepting the construct), so I have to figure out what it does - or rather, was intended to do.

The relevant part of the regex:

(?<!(\bb\s|\bb|^[a-z]\s|^[a-z]))

The piece that gives me headaches is \bb. I know of \b, but I could not find mention of \bb, and looking for a literal 'b' would not make sense here. Is \bb some special underdocumented feature, or do I have to consider this a typo?

like image 346
DevSolar Avatar asked Nov 29 '10 14:11

DevSolar


2 Answers

As Boost seems to be a regex engine for C++, and one of the compatibility modes is perl compatibility--if that is a "perl-compatible" expression, than the second 'b' can only be a literal.

It's a valid expression, pretty much a special case for words beginning with 'b'.

It seems to be the deciding factor that this is a c++ library, and that it's to give environments that aren't perl, perl-compatible regexes. Thus my original thought that perl might interpret the expression (say with overload::constant) is invalid. Yet it is still worth mentioning just for clarification purposes, regardless of how inadvisable it would be tweak an expression meaning "word beginning with 'b'".

The only caveat to that idea is that perhaps Boost out-performs Perl at it's own expressions and somebody would be using the Boost engine in a Perl environment, then all bets are off as to whether that could have been meant as a special expression. This is just one stab, given a grammar where '!!!' meant something special at the beginning of words, you could piggyback on the established meaning like this (NOT RECOMMENDED!)

s/\\bb\b/(?:!!!(\\p{Alpha})|\\bb)/

This would be something dumb to do, but as we are dealing with code that seems unfit for its task, there are thousands of ways to fail at a task.

like image 70
Axeman Avatar answered Sep 18 '22 07:09

Axeman


(\bb\s|\bb|^[a-z]\s|^[a-z]) matches a b if it's not preceded by another word character, or any lowercase letter if it's at the beginning of the string. In either case, the letter may be followed by a whitespace character. (It could match uppercase letters too if case-insensitive mode is set, and the ^ could also match the beginning of a line if multiline mode is set.)

But inside a lookbehind, that shouldn't even have compiled. In some flavors, a lookbehind can contain multiple alternatives with different, fixed lengths, but the alternation has to be at the top level in the lookbehind. That is, (?<=abc|xy|12345) will work, but (?<=(abc|xy|12345)) won't. So your regex wouldn't work even in those flavors, but Boost's docs just say the lookbehind expression has to be fixed-length.

If you really need to account for all four of the possibilities matched by that regex, I suggest you split the lookbehind into two:

(?<!\bb|^[a-z])(?<!(?:\bb|^[a-z])\s)
like image 20
Alan Moore Avatar answered Sep 21 '22 07:09

Alan Moore