Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is There A Fast GetToken Routine For Delphi?

Tags:

In my program, I process millions of strings that have a special character, e.g. "|" to separate tokens within each string. I have a function to return the n'th token, and this is it:

function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string; { LK Feb 12, 2007 - This function has been optimized as best as possible } var  I, P, P2: integer; begin   P2 := Pos(Delim, Line);   if TokenNum = 1 then begin     if P2 = 0 then       Result := Line     else       Result := copy(Line, 1, P2-1);   end   else begin     P := 0; { To prevent warnings }     for I := 2 to TokenNum do begin       P := P2;       if P = 0 then break;       P2 := PosEx(Delim, Line, P+1);     end;     if P = 0 then       Result := ''     else if P2 = 0 then       Result := copy(Line, P+1, MaxInt)     else       Result := copy(Line, P+1, P2-P-1);   end; end; { GetTok } 

I developed this function back when I was using Delphi 4. It calls the very efficient PosEx routine that was originally developed by Fastcode and is now included in the StrUtils library of Delphi.

I recently upgraded to Delphi 2009 and my strings are all Unicode. This GetTok function still works and still works well.

I have gone through the new libraries in Delphi 2009 and there are many new functions and additions to it.

But I have not seen a GetToken function like I need in any of the new Delphi libraries, in the various fastcode projects, and I can't find anything with a Google search other than Zarko Gajic's: Delphi Split / Tokenizer Functions, which is not as optimized as what I already have.

Any improvement, even 10% would be noticeable in my program. I know an alternative is StringLists and to always keep the tokens separate, but this has a big overhead memory-wise and I'm not sure if I did all that work to convert whether it would be any faster.

Whew. So after all this long winded talk, my question really is:

Do you know of any very fast implementations of a GetToken routine? An assembler optimized version would be ideal?

If not, are there any optimizations that you can see to my code above that might make an improvement?


Followup: Barry Kelly mentioned a question I asked a year ago about optimizing the parsing of the lines in a file. At that time I hadn't even thought of my GetTok routine which was not used for the that read or parsing. It is only now that I saw the overhead of my GetTok routine which led me to ask this question. Until Carl Smotricz and Barry's answers, I had never thought of connecting the two. So obvious, but it just didn't register. Thanks for pointing that out.

Yes, my Delim is a single character, so obviously I have some major optimization I can do. My use of Pos and PosEx in the GetTok routine (above) blinded me to the idea that I can do it faster with a character by character search instead, with bits of code like:

      while (cp^ > #0) and (cp^ <= Delim) do             Inc(cp); 

I'm going to go through everyone's answers and try the various suggestions and compare them. Then I'll post the results.


Confusion: Okay, now I'm really perplexed.

I took Carl and Barry's recommendation to go with PChars, and here is my implementation:

function GetTok(const Line: string; const Delim: string; const TokenNum: Byte): string; { LK Feb 12, 2007 - This function has been optimized as best as possible } { LK Nov 7, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx } { See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi } var  I: integer;  PLine, PStart: PChar; begin   PLine := PChar(Line);   PStart := PLine;   inc(PLine);   for I := 1 to TokenNum do begin     while (PLine^ <> #0) and (PLine^ <> Delim) do       inc(PLine);     if I = TokenNum then begin       SetString(Result, PStart, PLine - PStart);       break;     end;     if PLine^ = #0 then begin       Result := '';       break;     end;     inc(PLine);     PStart := PLine;   end; end; { GetTok } 

On paper, I don't think you can do much better than this.

So I put both routines to the task and used AQTime to see what's happening. The run I had included 1,108,514 calls to GetTok.

AQTime timed the original routine at 0.40 seconds. The million calls to Pos took 0.10 seconds. A half a million of the TokenNum = 1 copies took 0.10 seconds. The 600,000 PosEx calls only took 0.03 seconds.

Then I timed my new routine with AQTime for the same run and exactly the same calls. AQTime reports that my new "fast" routine took 3.65 seconds, which is 9 times as long. The culprit according to AQTime was the first loop:

     while (PLine^ <> #0) and (PLine^ <> Delim) do        inc(PLine); 

The while line, which was executed 18 million times, was reported at 2.66 seconds. The inc line, executed 16 million times, was said to take 0.47 seconds.

Now I thought I knew what was happening here. I had a similar problem with AQTime in a question I posed last year: Why is CharInSet faster than Case statement?

Again it was Barry Kelly who clued me in. Basically, an instrumenting profiler like AQTime does not necessarily do the job for microoptimization. It adds an overhead to each line which may swamp the results which is shown clearly in these numbers. The 34 million lines executed in my new "optimized code" overwhelm the several million lines of my original code, with apparently little or no overhead from the Pos and PosEx routines.

Barry gave me a sample of code using QueryPerformanceCounter to check that he was correct, and in that case he was.

Okay, so let's do the same now with QueryPerformanceCounter to prove that my new routine is faster and not 9 times slower as AQTime says it is. So here I go:

function TimeIt(const Title: string): double; var  i: Integer;   start, finish, freq: Int64;   Seconds: double; begin   QueryPerformanceCounter(start);   for i := 1 to 250000 do     GetTokOld('This is a string|that needs|parsing', '|', 1);   for i := 1 to 250000 do     GetTokOld('This is a string|that needs|parsing', '|', 2);   for i := 1 to 250000 do     GetTokOld('This is a string|that needs|parsing', '|', 3);   for i := 1 to 250000 do     GetTokOld('This is a string|that needs|parsing', '|', 4);   QueryPerformanceCounter(finish);   QueryPerformanceFrequency(freq);   Seconds := (finish - start) / freq;   Result := Seconds; end; 

So this will test 1,000,000 calls to GetTok.

My old procedure with the Pos and PosEx calls took 0.29 seconds. The new one with PChars took 2.07 seconds.

Now I am completely befuddled! Can anyone tell me why the PChar procedure is not only slower, but is 8 to 9 times slower!?


Mystery solved! Andreas said in his answer to change the Delim parameter from a string to a Char. I'll always be using just a Char, so at least for my implementation this is very possible. I was amazed at what happened.

The time for the 1 million calls went down from 1.88 seconds to .22 seconds.

And surprisingly, the time for my original Pos/PosEx routine went UP from .29 to .44 seconds when I changed it's Delim parameter to a Char.

Frankly, I'm disappointed by Delphi's optimizer. That Delim is a constant parameter. The optimizer should have noticed that the same conversion is happening within the loop and should have moved it out so that it would only be done once.

Double checking my Code generation parameters, yes I do have Optimization True and String format checking Off.

Bottom line is that the new PChar routine with Andrea's fix is about 25% faster than my original (.22 versus .29).

I still want to follow up on the other comments here and test them out.


Turning off optimization and turning on String format checking only increases the time from .22 to .30. It adds about the same to the original.

The advantage to using assembler code, or calling routines written in assembler like Pos or PosEx is that they are NOT subject to what code generation options you have set. They will always run the same way, a pre-optimized and non-bloated way.

I have reaffirmed in the last couple of days, that the best way to compare code for microoptimization is to look at and compare the Assembler code in the CPU window. It would be nice if Embarcadero could make that window a bit more convenient, and allow us to copy portions to the clipboard or to print sections of it.

Also, I unfairly slammed AQTime earlier in this post, thinking that the extra time added for my new routine was solely because of the instrumentation it added. Now that I go back and check with the Char parameter instead of String, the while loop is down to .30 seconds (from 2.66) and the inc line is down to .14 seconds (from .47). Strange that the inc line would go down as well. But I'm getting worn out from all this testing already.


I took Carl's idea of looping by characters, and rewrote that code with that idea. It makes another improvement, down to .19 seconds from .22. So here is now the best so far:

function GetTok(const Line: string; const Delim: Char; const TokenNum: Byte): string; { LK Nov 8, 2009 - Reoptimized using PChars instead of calls to Pos and PosEx } { See; https://stackoverflow.com/questions/1694001/is-there-a-fast-gettoken-routine-for-delphi } var   I, CurToken: Integer;   PLine, PStart: PChar; begin   CurToken := 1;   PLine := PChar(Line);   PStart := PLine;   for I := 1 to length(Line) do begin     if PLine^ = Delim then begin       if CurToken = TokenNum then         break       else begin         CurToken := CurToken + 1;         inc(PLine);         PStart := PLine;       end;     end     else       inc(PLine);   end;   if CurToken = TokenNum then     SetString(Result, PStart, PLine - PStart)   else     Result := ''; end; 

There still may be some minor optimizations to this, such as the CurToken = Tokennum comparison, which should be the same type, Integer or Byte, whichever is faster.

But let's say, I'm happy now.

Thanks again to the StackOverflow Delphi community.

like image 484
lkessler Avatar asked Nov 07 '09 18:11

lkessler


1 Answers

It makes a big difference what "Delim" is expected to be. If it's expected to be a single character, you're far better off stepping through the string character by character, ideally through a PChar, and testing specifically.

If it's a long string, Boyer-Moore and similar searches have a set-up phase for skip tables, and the best way would be to build the tables once, and reuse them for each subsequent find. That means you need state between calls, and this function would be better off as a method on an object instead.

You might be interested in this answer I gave to a question some time before, about the fastest way to parse a line in Delphi. (But I see that it is you that asked the question! Nevertheless, in solving your problem, I would hew to how I described parsing, not using PosEx like you are using, depending on what Delim normally looks like.)


UPDATE: OK, I spent about 40 minutes looking at this. If you know the delimiter is going to be a character, you're pretty much always better off with the second version (i.e. PChar scanning), but you have to pass Delim as a character. At the time of writing, you're converting the PLine^ expression - of type Char - to a string for comparison with Delim. That will be very slow; even indexing into the string, with Delim[1] will also be somewhat slow.

However, depending on how large your lines are, and how many delimited pieces you want to pull out, you may be better off with a resumable approach, rather than skipping unwanted delimited pieces inside the tokenizing routine. If you call GetTok with successively increasing indexes, like you are currently doing in your mini benchmark, you'll end up with O(n*n) performance, where n is the number of delimited sections. That can be turned into O(n) if you save the state of the scan and restore it for the next iteration, or pack all extracted items into an array.

Here's a version that does all tokenization once, and returns an array. It needs to tokenize twice though, in order to know how large to make the array. On the other hand, only the second tokenization needs to extract the strings:

// Do all tokenization up front. function GetTok4(const Line: string; const Delim: Char): TArray<string>; var   cp, start: PChar;   count: Integer; begin   // Count sections   count := 1;   cp := PChar(Line);   start := cp;   while True do   begin     if cp^ <> #0 then     begin       if cp^ <> Delim then         Inc(cp)       else       begin         Inc(cp);         Inc(count);       end;     end     else     begin       Inc(count);       Break;     end;   end;    SetLength(Result, count);   cp := start;   count := 0;    while True do   begin     if cp^ <> #0 then     begin       if cp^ <> Delim then         Inc(cp)       else       begin         SetString(Result[count], start, cp - start);         Inc(cp);         Inc(count);       end;     end     else     begin       SetString(Result[count], start, cp - start);       Break;     end;   end; end; 

Here's the resumable approach. The loads and stores of the current position and delimiter character do have a cost, though:

type   TTokenizer = record   private     FSource: string;     FCurrPos: PChar;     FDelim: Char;   public     procedure Reset(const ASource: string; ADelim: Char); inline;     function GetToken(out AResult: string): Boolean; inline;   end;  procedure TTokenizer.Reset(const ASource: string; ADelim: Char); begin   FSource := ASource; // keep reference alive   FCurrPos := PChar(FSource);   FDelim := ADelim; end;  function TTokenizer.GetToken(out AResult: string): Boolean; var   cp, start: PChar;   delim: Char; begin   // copy members to locals for better optimization   cp := FCurrPos;   delim := FDelim;    if cp^ = #0 then   begin     AResult := '';     Exit(False);   end;    start := cp;   while (cp^ <> #0) and (cp^ <> Delim) do     Inc(cp);    SetString(AResult, start, cp - start);   if cp^ = Delim then     Inc(cp);   FCurrPos := cp;   Result := True; end; 

Here's the full program I used for benchmarking.

Here are the results:

*** count=3, Length(src)=200 GetTok1: 595 ms GetTok2: 547 ms GetTok3: 2366 ms GetTok4: 407 ms GetTokBK: 226 ms *** count=6, Length(src)=350 GetTok1: 1587 ms GetTok2: 1502 ms GetTok3: 6890 ms GetTok4: 679 ms GetTokBK: 334 ms *** count=9, Length(src)=500 GetTok1: 3055 ms GetTok2: 2912 ms GetTok3: 13766 ms GetTok4: 947 ms GetTokBK: 446 ms *** count=12, Length(src)=650 GetTok1: 4997 ms GetTok2: 4803 ms GetTok3: 23021 ms GetTok4: 1213 ms GetTokBK: 543 ms *** count=15, Length(src)=800 GetTok1: 7417 ms GetTok2: 7173 ms GetTok3: 34644 ms GetTok4: 1480 ms GetTokBK: 653 ms 

Depending on the characteristics of your data, whether the delimiter is likely to be a character or not, and how you work with it, different approaches may be faster.

(I made a mistake in my earlier program, I wasn't measuring the same operations for each style of routine. I updated the pastebin link and benchmark results.)

like image 125
Barry Kelly Avatar answered Oct 11 '22 13:10

Barry Kelly