I need regex help to create a delphi function to replace the HyperString ParseWord function in Rad Studio XE2. HyperString was a very useful string library that never made the jump to Unicode. I've got it mostly working but it doesn't honor quote delimiters at all. I need it to be an exact match for the function described below:
function ParseWord(const Source,Table:String;var Index:Integer):String;
Sequential, left to right token parsing using a table of single character delimiters. Delimiters within quoted strings are ignored. Quote delimiters are not allowed in Table.
Index is a pointer (initialize to '1' for first word) updated by the function to point to next word. To retrieve the next word, simply call the function again using the prior returned Index value.
Note: If Length(Resultant) = 0, no additional words are available. Delimiters within quoted strings are ignored. (my emphasis)
This is what I have so far:
function ParseWord( const Source, Table: String; var Index: Integer):string;
var
RE : TRegEx;
match : TMatch;
Table2,
chars : string;
begin
if index = length(Source) then
begin
result:= '';
exit;
end;
// escape the special characters and wrap in a Group
Table2 :='['+TRegEx.Escape(Table, false)+']';
RE := TRegEx.create(Table2);
match := RE.Match(Source,Index);
if match.success then
begin
result := copy( Source, Index, match.Index - Index);
Index := match.Index+match.Length;
end
else
begin
result := copy(Source, Index, length(Source)-Index+1);
Index := length(Source);
end;
end;
while ( Length(result)= 0) and (Index<length(Source)) do
begin
Inc(Index);
result := ParseWord(Source,Table, Index);
end;
cheers and thanks.
I would try this regex for Table2
:
Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';
Demo:
This demo is more a POC since I was unable to find an online delphi regex tester.
space
(ASCII code 32
) and pipe
(ASCII code 124
) characters.toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''
Discussion:
I assume that a quoted string is a string enclosed by either two single quotes ('
) or two double quotes ("
). Correct me if I am wrong.
The regex will match either:
Known bug:
Since I didn't know how ParseWord handle quote escaping inside string, the regex doesn't support this feature.
For instance :
'foo''bar'
? => Two tokens : 'foo'
and 'bar'
OR one single token 'foo''bar'
."foo""bar"
? => Two tokens : "foo"
and "bar"
OR one single token "foo""bar"
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With