I've written a regex whose job is to return all matches to its three alternative capture groups. My goal is to learn which capture group produced each match. PCRE seems able to produce that information. But I haven't yet been able coerce the TRegEx
class in Delphi XE8 to yield meaningful capture group info for matches. I can't claim to be at the head of regex class, and TRegEx
is new to me, so who knows what errors I'm making.
The regex (regex101.com workpad) is:
(?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
This test text:
externship L763 clinic 207 B706 b512
gives five matches in test environments. But a simple test program that walks the TGroupCollection
of each TMatch
in the TMatchCollection
shows bizarre results about groups: all matches have more than one group (2, 3 or 4) with each group's Success
true, and often the matched text is duplicated in several groups or is empty. So this data structure (below) isn't what I'd expect:
Using TRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
5 matches
'externship' with 2 groups:
length 10 at 1 value 'externship' (Sucess? True)
length 10 at 1 value 'externship' (Sucess? True)
'L763' with 4 groups:
length 4 at 12 value 'L763' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 12 value 'L763' (Sucess? True)
'clinic' with 2 groups:
length 6 at 17 value 'clinic' (Sucess? True)
length 6 at 17 value 'clinic' (Sucess? True)
'207' with 3 groups:
length 3 at 24 value '207' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 3 at 24 value '207' (Sucess? True)
'B706' with 4 groups:
length 4 at 28 value 'B706' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 0 at 1 value '' (Sucess? True)
length 4 at 28 value 'B706' (Sucess? True)
My simple test runner is this:
program regex_tester;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,
System.RegularExpressions,
System.RegularExpressionsCore;
var
Matched : Boolean;
J : integer;
Group : TGroup;
Match : TMatch;
Matches : TMatchCollection;
RegexText,
TestText : String;
RX : TRegEx;
RXPerl : TPerlRegEx;
begin
try
RegexText:='(?''word''\b[a-zA-Z]{3,}\b)|(?''id''\b\d{1,3}\b)|(?''course''\b[BL]\d{3}\b)';
TestText:='externship L763 clinic 207 B706 b512';
RX:=TRegex.Create(RegexText);
Matches:=RX.Matches(TestText);
Writeln(Format(#10#13#10#13'Using TRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RegexText, TestText]));
Writeln(Format('%d matches', [Matches.Count]));
for Match in Matches do
begin
Writeln(Format(' ''%s'' with %d groups:', [Match.Value,Match.Groups.Count]));
for Group in Match.Groups do
Writeln(Format(#9'length %d at %d value ''%s'' (Sucess? %s)', [Group.Length,Group.Index,Group.Value,BoolToStr(Group.Success, True)]));
end;
RXPerl:=TPerlRegEx.Create;
RXPerl.Subject:=TestText;
RXPerl.RegEx:=RegexText;
Writeln(Format(#10#13#10#13'Using TPerlRegEx'#10#13'Regex: %s'#10#13'Text: %s'#10#13,[RXPerl.Regex, RXPerl.Subject]));
Matched:=RXPerl.Match;
if Matched then
repeat
begin
Writeln(Format(' ''%s'' with %d groups:', [RXPerl.MatchedText,RXPerl.GroupCount]));
for J:=1 to RXPerl.GroupCount do
Writeln(Format(#9'length %d at %d, value ''%s''',[RXPerl.GroupLengths[J],RXPerl.GroupOffsets[J],RXPerl.Groups[J]]));
Matched:=RXPerl.MatchAgain;
end;
until Matched=false;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
I'd certainly appreciate a nudge in the right direction. If TRegEx
is broken, I can of course use an alternative -- or I can let go of the perceived elegance of the solution, instead using three simpler tests to find the bits of info I need.
As @andrei-galatyn notes, TRegEx
uses TPerlRegEx
for its work. So I added a section to my testing program (output below) where I experiment with that, too. It isn't as convenient to use as TRegEx
, but its result is what it should be -- and without the problems of TRegEx's broken TGroup
data structures. Whichever class I use, the last group's index (less 1 for TRegEx) is the capturing group I want.
Along the way I was reminded that Pascal arrays are often based on 1 rather than 0.
Using TPerlRegEx
Regex: (?'word'\b[a-zA-Z]{3,}\b)|(?'id'\b\d{1,3}\b)|(?'course'\b[BL]\d{3}\b)
Text: externship L763 clinic 207 B706 b512
'externship' with 1 groups:
length 10 at 1, value 'externship'
'L763' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 12, value 'L763'
'clinic' with 1 groups:
length 6 at 17, value 'clinic'
'207' with 2 groups:
length 0 at 1, value ''
length 3 at 24, value '207'
'B706' with 3 groups:
length 0 at 1, value ''
length 0 at 1, value ''
length 4 at 28, value 'B706'
If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?' name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .
Normally, within a pattern, you create a back-reference to the content a capture group previously matched by using a backslash followed by the group number—for instance \1 for Group 1. (The syntax for replacements can vary.)
Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g" .
Internally Delphi uses class TPerlRegEx and it has such description for GroupCount property:
Number of matched groups stored in the Groups array. This number is the number of the highest-numbered capturing group in your regular expression that actually participated in the last match. It may be less than the number of capturing groups in your regular expression.
E.g. when the regex "(a)|(b)" matches "a", GroupCount will be 1. When the same regex matches "b", GroupCount will be 2.
TRegEx class always adds one more group (for whole expression i guess). In your case it should be enough to check every match like this:
case Match.Groups.Count-1 of
1: ; // "word" found
2: ; // "id" found
3: ; // "course" found
end;
It doesn't answer why Groups are filled with strange data, indeed it seems to be enough to answer your question. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With