Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using regex to skip ahead all characters until a specific sequence of letters is found using negative lookahead

I'm alright with basic regular expressions, but I get a bit lost around pos/neg look aheads/behinds.

I'm trying to pull the id # from this:

[keyword stuff=otherstuff id=123 morestuff=stuff]

There could be unlimited amounts of "stuff" before or after. I've been using The Regex Coach to help debug what I've tried, but I'm not moving forward anymore...

So far I have this:

\[keyword (?:id=([0-9]+))?[^\]]*\]

Which takes care of any extra attributes after the id, but I can't figure out how to ignore everything between keyword and id. I know I can't go [^id]* I believe I need to use a negative lookahead like this (?!id)* but I guess since it's zero-width, it doesn't move forward from there. This doesn't work either:

\[keyword[A-z0-9 =]*(?!id)(?:id=([0-9]+))?[^\]]*\]

I've been looking all over for examples, but haven't found any. Or perhaps I have, but they went so far over my head I didn't even realize what they were.

Help! Thanks.

EDIT: It has to match [keyword stuff=otherstuff] as well, where id= doesn't exist at all, so I have to have a 1 or 0 on the id # group. There are also other [otherkeywords id=32] which I do not want to match. The document needs to match multiple [keyword id=3] throughout the documents using preg_match_all.

like image 582
phazei Avatar asked Jul 20 '10 00:07

phazei


2 Answers

No lookahead/behind required:

/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/

Added the ending '[^]]*]' to check for a real tag end, could be unnecessary.

Edit: added the \b to id as otherwise it could match [keyword you-dont-want-this-guid=123123-132123-123 id=123]

$ php -r 'preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff morestuff=stuff]",$matches);var_dump($matches);'
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(42) "[keyword stuff=otherstuff morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(0) ""
  }
}
$ php -r 'var_dump(preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff id=123 morestuff=stuff]",$matches),$matches);'
int(1)
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(49) "[keyword stuff=otherstuff id=123 morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(3) "123"
  }
}
like image 146
Wrikken Avatar answered Nov 17 '22 02:11

Wrikken


You do not need look ahead / behind.

Since the question is tagged PHP, use preg_match_all() and store the match in $matches.

Here's how:

<?php

  // Store the string. I single quote, in case there are backslashes I
  // didn't see.
$string = 'blah blah[keyword stuff=otherstuff id=123 morestuff=stuff]
           blah blah[otherkeyword stuff=otherstuff id=555 morestuff=stuff]
           blah blah[keyword stuff=otherstuff id=444 morestuff=stuff]';

  // The pattern is '[keyword' followed by not ']' a space and id
  // The space before id is important, so you don't catch 'guid', etc.
  // If '[keyword'  is always at the beginning of a line, you can use
  // '^\[keyword'
$pattern = '/\[keyword[^\]]* id=([0-9]+)/';

  // Find every single $pattern in $string and store it in $matches
preg_match_all($pattern, $string, $matches);

  // The only tricky part you have to know is that each entire match is stored in
  // $matches[0][x], and the part of the match in the parentheses, which is what
  // you want is stored in $matches[1][x]. The brackets are optional, since it's
  // only one line.
foreach($matches[1] as $value)
{     
    echo $value . "<br/>";
}
?>

Output:

123
444   

( 555 is skipped, as it should be)

PS

You can also use \b instead of a literal space if there could be a tab instead. \b represents a word boundary... in this case the beginning of a word.

$pattern = '/\[keyword[^\]]*\bid=([0-9]+)/';
like image 38
Peter Ajtai Avatar answered Nov 17 '22 03:11

Peter Ajtai