Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: Repeated capturing groups

Tags:

c#

.net

regex

I have to parse some tables from an ASCII text file. Here's a partial sample:

QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
RECKITTBEN  192.50  209.00  192.50  201.80    5.21      34      2850     5.707
RUPALIINS   150.00  159.00  150.00  156.25    6.29       4        80      .125
SALAMCRST   164.00  164.75  163.00  163.25    -.45      80      8250    13.505
SINGERBD    779.75  779.75  770.00  773.00    -.89       8        95      .735
SONARBAINS   68.00   69.00   67.50   68.00     .74      11      3050     2.077

The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.

I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:

(\S+)\s+(\s+[\d\.\-]+){8}

But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:

You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations.

I've consulted their help file, but I don't have a clue as to how to solve this.

How can I capture each column separately?

like image 633
invarbrass Avatar asked Jul 03 '10 19:07

invarbrass


2 Answers

In C# (modified from this example):

string input = "QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
   Console.WriteLine("Matched text: {0}", match.Value);
   for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
      Console.WriteLine("   Group {0}:  {1}", ctr, match.Groups[ctr].Value);
      int captureCtr = 0;
      foreach (Capture capture in match.Groups[ctr].Captures) {
         Console.WriteLine("      Capture {0}: {1}", 
                           captureCtr, capture.Value);
         captureCtr++; 
      }
   }
}

Output:

Matched text: QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
...
    Group 2:      1.212
         Capture 0:  11.00
         Capture 1:    11.10
         Capture 2:    11.00
...etc.
like image 120
Tim Pietzcker Avatar answered Oct 14 '22 21:10

Tim Pietzcker


Unfortunately you need to repeat the (…) 8 times to get each column separately.

^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$

If code is possible, you can first match those numeric columns as a whole

>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)

then split the columns by spaces

>>> [[p] + q.split() for p, q in allres]
like image 26
kennytm Avatar answered Oct 14 '22 21:10

kennytm