Usually, I like the challenges of regular expressions and even better - solving them.
But it seems I have a case that I can't figure out.
I have a string of values that are separated by a semi-colon like CSV line that can look like this one:
123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;
In this line I would like to match all integers and integer ranges in order to extract them later. It is possible that only single value (no semi-colon).
After a lot of searching I managed to write this expression:(?:^|;)(?<range>\d+-\d+)(?:$|;)|(?:^|;)(?<integer>\d+)(?:$|;)
The test strings I'm using:
123
123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;
123-456
123-FOO
FOO-123
FOO-FOO
Lines 1 and 3 are correctly matched, and lines 4,5 6 are not.
In line 2, only one value of two is correctly matched.
Here is a link to regex101.com that illustrates it: https://regex101.com/r/zA7uI9/5
I would also need to select integers and the ranges separately (in different groups).
Note: I found a question that could help me and tried its answer (by adapting it) but it didn't work.
Regular expression for matching numbers and ranges of numbers
Have you got any idea on what I'm missing?
The language that will "use" this regex is C#, but I don't know if it's a useful information for my problem.
added by barlop
Here are the matches the current regex gives him, as shown by that regex101.com link
and for this test string of his 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;89
123-234
45-67
890
11-22
123
098-567
so his regex seems to be missing out one of the 123s, and the 44-45, and the 89 at the end.
Use the built-in CSV parser and check each field separately:
using Microsoft.VisualBasic.FileIO;
....
var str = "123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;";
var csv_parser = new TextFieldParser(new StringReader(str));
csv_parser.HasFieldsEnclosedInQuotes = false; // Fields are not enclosed with quotes
csv_parser.SetDelimiters(";"); // Setting delimiter
string[] fields;
var range_fields = new List<string>();
var integer_fields = new List<string>();
while (!csv_parser.EndOfData)
{
fields = csv_parser.ReadFields();
foreach (var field in fields)
{
if (!string.IsNullOrWhiteSpace(field) && field.All(x => Char.IsDigit(x)))
{
integer_fields.Add(field);
Console.WriteLine(string.Format("Intger field: {0}", field));
}
else if (!string.IsNullOrWhiteSpace(field) && Regex.IsMatch(field, @"\d+-\d+"))
{
range_fields.Add(field);
Console.WriteLine(string.Format("Range field: {0}", field));
}
}
}
csv_parser.Close();
The results is:
Range field: 123-234
Range field: 45-67
Intger field: 890
Range field: 11-22
Intger field: 123
Intger field: 123
Range field: 44-55
Range field: 098-567
Intger field: 890
The reason for your regex failure is that you actually consume the delimiters with the non-capturing groups (i.e. (?:^|;)
and (?:$|;)
still match text, that text is appended to the match value, and the regex index is advanced to the position after the ;
, start/end of string).
What you need to use is lookarounds. They do not consume text, they just check if some text matching the lookaround pattern can or cannot be found before or after the current position. Thus, you get a chance to obtain overlapping matches, and it is one of the scenarios when the lookarounds are very handy.
(?<=^|;)((?<range>\d+-\d+)|(?<integer>\d+))(?=$|;)
The regex demo for a .NET regex at a .NET regex syntax supporting RegexStorm
And a nice-to-have diagram:
Note the use of the RegexOptions.ExplicitCapture
flag: this way, we avoid getting the submatches captured with the numbered (i.e. unnamed) capturing groups and only get the named captures (just what we need).
C# demo:
var s = "123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;890;123-FOO;";
var rx = new Regex(@"(?<=^|;)((?<range>\d+-\d+)|(?<integer>\d+))(?=$|;)", RegexOptions.ExplicitCapture);
var result = rx.Matches(s)
.Cast<Match>()
.Select(x => x.Groups["range"].Success ?
x.Groups["range"].Value : x.Groups["integer"].Value
).ToList();
foreach (var x in result)
Console.WriteLine(x);
I can't easily see capture groups in regex101 so that part may need some tweaking, but this gets all the matches correct, and it captures. Hopefully somebody will post an improved answer, but in the meantime.
(^\d+(?=;|$))|((?<=;)\d+$)|(?<=;)\d+(?=;)|\d+-\d+
graph like pic added by ro yo
The logic is,
Match if (^\d+(?=;|$))
OR ((?<=;)\d+$)
OR (?<=;)\d+(?=;)
OR \d+-\d+
i.e. e.g. a 123 at the beginning(or alone), a 123 at the end, a 123 in the middle, or a range wherever.
I can't quite get regex101.com to list the matches, but the regex works
C:\blah>echo 123-234;FOO-456;45-67;FOO-FOO;890;FOO-123;11-22;123;123;44-55;098-567;89| grep -oP "(^\d+(?=;))|((?<=;)\d+$)|(?<=;)\d+(?=;)|\d+-\d+"
123-234
45-67
890
11-22
123
123
44-55
098-567
89
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With