Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match both numbers and range of numbers in a CSV-like string with regex?

Tags:

c#

.net

regex

Usually, I like the challenges of regular expressions and even better - solving them.
But it seems I have a case that I can't figure out.

I have a string of values that are separated by a semi-colon like CSV line that can look like this one: 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;

In this line I would like to match all integers and integer ranges in order to extract them later. It is possible that only single value (no semi-colon).

After a lot of searching I managed to write this expression:
(?:^|;)(?<range>\d+-\d+)(?:$|;)|(?:^|;)(?<integer>\d+)(?:$|;)

The test strings I'm using:

  1. 123
  2. 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;
  3. 123-456
  4. 123-FOO
  5. FOO-123
  6. FOO-FOO

Lines 1 and 3 are correctly matched, and lines 4,5 6 are not.
In line 2, only one value of two is correctly matched.

Here is a link to regex101.com that illustrates it: https://regex101.com/r/zA7uI9/5

I would also need to select integers and the ranges separately (in different groups).

Note: I found a question that could help me and tried its answer (by adapting it) but it didn't work.
Regular expression for matching numbers and ranges of numbers

Have you got any idea on what I'm missing?

The language that will "use" this regex is C#, but I don't know if it's a useful information for my problem.

added by barlop

Here are the matches the current regex gives him, as shown by that regex101.com link

and for this test string of his 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;89

123-234
45-67
890
11-22
123
098-567

so his regex seems to be missing out one of the 123s, and the 44-45, and the 89 at the end.

like image 628
Niitaku Avatar asked May 31 '16 21:05

Niitaku


2 Answers

C# CSV String Parsing

Use the built-in CSV parser and check each field separately:

using Microsoft.VisualBasic.FileIO;
....
var str = "123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;";
var csv_parser = new TextFieldParser(new StringReader(str));
csv_parser.HasFieldsEnclosedInQuotes = false;   // Fields are not enclosed with quotes
csv_parser.SetDelimiters(";");                  // Setting delimiter
string[] fields;
var range_fields = new List<string>();
var integer_fields = new List<string>();
while (!csv_parser.EndOfData)
{
    fields = csv_parser.ReadFields();
    foreach (var field in fields)
    {
        if (!string.IsNullOrWhiteSpace(field) && field.All(x => Char.IsDigit(x)))
        {
            integer_fields.Add(field);
            Console.WriteLine(string.Format("Intger field: {0}", field));
        }
        else if (!string.IsNullOrWhiteSpace(field) && Regex.IsMatch(field, @"\d+-\d+"))
        {
             range_fields.Add(field);
             Console.WriteLine(string.Format("Range field: {0}", field));
        }
    }
}
csv_parser.Close();

The results is:

Range field: 123-234
Range field: 45-67
Intger field: 890
Range field: 11-22
Intger field: 123
Intger field: 123
Range field: 44-55
Range field: 098-567
Intger field: 890

Fixing Regex Approach

The reason for your regex failure is that you actually consume the delimiters with the non-capturing groups (i.e. (?:^|;) and (?:$|;) still match text, that text is appended to the match value, and the regex index is advanced to the position after the ;, start/end of string).

What you need to use is lookarounds. They do not consume text, they just check if some text matching the lookaround pattern can or cannot be found before or after the current position. Thus, you get a chance to obtain overlapping matches, and it is one of the scenarios when the lookarounds are very handy.

(?<=^|;)((?<range>\d+-\d+)|(?<integer>\d+))(?=$|;)

The regex demo for a .NET regex at a .NET regex syntax supporting RegexStorm

And a nice-to-have diagram:

enter image description here

Note the use of the RegexOptions.ExplicitCapture flag: this way, we avoid getting the submatches captured with the numbered (i.e. unnamed) capturing groups and only get the named captures (just what we need).

C# demo:

var s = "123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;";
var rx = new Regex(@"(?<=^|;)((?<range>\d+-\d+)|(?<integer>\d+))(?=$|;)", RegexOptions.ExplicitCapture);
var result = rx.Matches(s)
        .Cast<Match>()
        .Select(x => x.Groups["range"].Success ? 
            x.Groups["range"].Value : x.Groups["integer"].Value
        ).ToList();
foreach (var x in result)
    Console.WriteLine(x);
like image 108
Wiktor Stribiżew Avatar answered Sep 28 '22 01:09

Wiktor Stribiżew


I can't easily see capture groups in regex101 so that part may need some tweaking, but this gets all the matches correct, and it captures. Hopefully somebody will post an improved answer, but in the meantime.

(^\d+(?=;|$))|((?<=;)\d+$)|(?<=;)\d+(?=;)|\d+-\d+

graph like pic added by ro yo

Regular expression visualization

enter image description here

The logic is,

Match if (^\d+(?=;|$)) OR ((?<=;)\d+$) OR (?<=;)\d+(?=;) OR \d+-\d+

i.e. e.g. a 123 at the beginning(or alone), a 123 at the end, a 123 in the middle, or a range wherever.

I can't quite get regex101.com to list the matches, but the regex works

C:\blah>echo 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;89| grep -oP "(^\d+(?=;))|((?<=;)\d+$)|(?<=;)\d+(?=;)|\d+-\d+"

123-234
45-67
890
11-22
123
123
44-55
098-567
89
like image 29
barlop Avatar answered Sep 28 '22 00:09

barlop