Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to skip quoted text in regex (or How to use HyperStr ParseWord with Unicode text ?)

Tags:

regex

delphi

I need regex help to create a delphi function to replace the HyperString ParseWord function in Rad Studio XE2. HyperString was a very useful string library that never made the jump to Unicode. I've got it mostly working but it doesn't honor quote delimiters at all. I need it to be an exact match for the function described below:

function ParseWord(const Source,Table:String;var Index:Integer):String;

Sequential, left to right token parsing using a table of single character delimiters. Delimiters within quoted strings are ignored. Quote delimiters are not allowed in Table.

Index is a pointer (initialize to '1' for first word) updated by the function to point to next word. To retrieve the next word, simply call the function again using the prior returned Index value.

Note: If Length(Resultant) = 0, no additional words are available. Delimiters within quoted strings are ignored. (my emphasis)

This is what I have so far:

function ParseWord( const Source, Table: String; var Index: Integer):string;
var
  RE : TRegEx;
  match : TMatch;
  Table2,
  chars : string;
begin
  if index = length(Source) then
  begin
    result:= '';
    exit;
  end;

  // escape the special characters and wrap in a Group
  Table2 :='['+TRegEx.Escape(Table, false)+']';
  RE := TRegEx.create(Table2);
  match := RE.Match(Source,Index);
  if match.success then
  begin
    result := copy( Source, Index, match.Index - Index);
    Index := match.Index+match.Length;
  end
  else
  begin
    result := copy(Source, Index, length(Source)-Index+1);
    Index := length(Source);
  end;
end;

  while ( Length(result)= 0) and (Index<length(Source)) do
  begin
    Inc(Index);
    result := ParseWord(Source,Table, Index);
  end;

cheers and thanks.

like image 839
marcp Avatar asked Nov 13 '22 20:11

marcp


1 Answers

I would try this regex for Table2:

Table2 := '''[^'']+''|"[^"]+"|[^' + TRegEx.Escape(Table, false) + ']+';

Demo:
This demo is more a POC since I was unable to find an online delphi regex tester.

  • The delimiters are the space (ASCII code 32) and pipe (ASCII code 124) characters.
  • The test sentence is:

    toto titi "alloa toutou" 'dfg erre' 1245|coucou "nestor|delphi" "" ''

http://regexr.com?32i81

Discussion:
I assume that a quoted string is a string enclosed by either two single quotes (') or two double quotes ("). Correct me if I am wrong.

The regex will match either:

  • a single quoted string
  • a double quoted string
  • a string not composed by any passed delimiters

Known bug:
Since I didn't know how ParseWord handle quote escaping inside string, the regex doesn't support this feature.

For instance :

  • How to interpret this 'foo''bar' ? => Two tokens : 'foo' and 'bar' OR one single token 'foo''bar'.
  • What about this case too : "foo""bar" ? => Two tokens : "foo" and "bar" OR one single token "foo""bar".
like image 147
Stephan Avatar answered Jan 04 '23 01:01

Stephan