Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Look behinds: all the rage in regex?

Many regex questions lately have some kind of look-around element in the query that appears to me is not necessary to the success of the match. Is there some teaching resource that is promoting them? I am trying to figure out what kinds of cases you would be better off using a positive look ahead/behind. The main application I can see is when trying to not match an element. But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?

(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span

And this one from another question:

$url = "www.example.com/id/1234";
preg_match("/\d+(?<=id\/[\d])/",$url,$matches);

When is it truly better to use a positive look-around? Can you give some examples?

I realize this is bordering on an opinion-based question, but I think the answers would be really instructive. Regex is confusing enough without making things more complicated... I have read this page and am more interested in some simple guidelines for when to use them rather than how they work.


Thanks for all the replies. In addition to those below, I recommend checking out m.buettner's great answer here.

like image 689
beroe Avatar asked Sep 30 '13 22:09

beroe


2 Answers

  1. You can capture overlapping matches, and you can find matches which could lie in the lookarounds of other matches.
  2. You can express complex logical assertions about your match (because many engines let you use multiple lookbehind/lookahead assertions which all must match in order for the match to succeed).
  3. Lookaround is a natural way to express the common constraint "matches X, if it is followed by/preceded by Y". It is (arguably) less natural to add extra "matching" parts that have to be thrown out by postprocessing.

Negative lookaround assertions, of course, are even more useful. Combined with #2, they can allow you do some pretty wizard tricks, which may even be hard to express in usual program logic.


Examples, by popular request:

  • Overlapping matches: suppose you want to find all candidate genes in a given genetic sequence. Genes generally start with ATG, and end with TAG, TAA or TGA. But, candidates could overlap: false starts may exist. So, you can use a regex like this:

    ATG(?=((?:...)*(?:TAG|TAA|TGA)))
    

    This simple regex looks for the ATG start-codon, followed by some number of codons, followed by a stop codon. It pulls out everything that looks like a gene (sans start codon), and properly outputs genes even if they overlap.

  • Zero-width matching: suppose you want to find every tr with a specific class in a computer-generated HTML page. You might do something like this:

    <tr class="TableRow">.*?</tr>(?=<tr class="TableRow">|</table>)
    

    This deals with the case in which a bare </tr> appears inside the row. (Of course, in general, an HTML parser is a better choice, but sometimes you just need something quick and dirty).

  • Multiple constraints: suppose you have a file with data like id:tag1,tag2,tag3,tag4, with tags in any order, and you want to find all rows with tags "green" and "egg". This can be done easily with two lookaheads:

    (.*):(?=.*\bgreen\b)(?=.*\begg\b)
    
like image 123
nneonneo Avatar answered Sep 23 '22 07:09

nneonneo


There are two great things about lookaround expressions:

  • They are zero-width assertions. They require to be matched, but they consume nothing of the input string. This allows to describe parts of the string which will not be contained in a match result. By using capturing groups in lookaround expressions, they are the only way to capture parts of the input multiple times.
  • They simplify a lot of things. While they do not extend regular languages, they easily allow to combine (intersect) multiple expressions to match the same part of a string.
like image 29
Bergi Avatar answered Sep 23 '22 07:09

Bergi