Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problem with whitespace in a RegEx with capture groups

I've got a regular expression that I'm trying to match against the following types of data, with each token separated by an unknown number of spaces.

Update: "Text" can be almost any character, which is why I had .* initially. Importantly, it can also include spaces.

  1. Text
  2. Text 01
  3. Text 01 of 03
  4. Text 01 (of 03)
  5. Text 01-03

I'd like to capture "Text", "01", and "03" as separate groups, and all except "Text" are optional. The best I've been able to do so far is:

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)

This matches #3-#5, and puts them in the proper capture groups. I can't figure out, though, why when I add an additional ? to the end to make the part of the expression after 01 optional, my capture groups get all funky.

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)?

The RegEx above matches #2-#5, but the capture groups are correct only for #2 and #5.

This seems like a straightforward regular expression, so I don't know why I'm having so much trouble with it.

This is a link to an online RegEx evaluator I'm using to help me debug this: http://regexr.com?2tb64. The link already has the first RegEx and the test data filled in.

like image 951
Dov Avatar asked Mar 18 '11 22:03

Dov


3 Answers

You didn't say which regex tool you are using so I am assuming the least common denominator i.e. Javascript. Here is one that works:

var re = /^\s*(.+?)(?:\s+(\d+)(?:(?:\s+\(?of\s+|-)(\d+)\)?)?)?$/i;

To make this work in your Regexr tool, be sure to turn on the "multi-line option".

Here it the same thing in PHP syntax (with lots of juicy comments!):

$re = '/ # Always write non-trivial regex in free-space mode!
    ^                  # Anchor to start of string.
    \s*                # optional leading whitspace is ok.
    (.+?)              # Text can be pretty much anything.
    (?:                # Group to allow applying ? quantifier
      \s+              # WS separates "Text" from first number.
      (\d+)            # First number.
      (?:              # Group to allow applying ? quantifier
        (?:            # Second number prefix alternatives
          \s+\(?of\s+  # Either " of 03" and " (of 03)",
        | -            # or just a dash  for "-03" case.
        )              # End second number prefix alternatives
        (\d+)          # Second number
        \)?            # Match ")" for " (of 03)" case.
      )?               # Second number is optional.
    )?                 # First numebr is optional.
    $                  # Anchor to start of string.
    /ix';
like image 114
ridgerunner Avatar answered Nov 23 '22 04:11

ridgerunner


Try this:
http://regexr.com?2tb67

Regex looks something like:

(\w+?)\s+(\d*)[^\d]*(\d+)

Match all letters, followed by any white spaces, then match all digits, followed by anything that's not digits, then match remaining digits.

Note that the second result probably isn't ideal for you because 01 comes in the third group match. But it matches all your cases.

like image 43
Joe Avatar answered Nov 23 '22 05:11

Joe


Your Second one is close

So I reworked: regexr, matches now all in the correct groups.

\s*(\w*)\s+(?:\s*(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?)?)?
like image 25
stema Avatar answered Nov 23 '22 05:11

stema