Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi: Why Doesn't Binary String Comparison Operator (=) use SameStr?

It is common knowledge that SameStr(S1, S2) is faster than S1 = S2, where var S1, S2: string in Delphi.

(And, of course, SameText(S1, S2) is much faster than AnsiLowerCase(S1) = AnsiLowerCase(S2).)

But, as far as I understand it, SameStr(S1, S2) does exactly the same thing as S1 = S2, so I cannot help but wonder why in the world the Delphi compiler doesn't use the SameStr code when it test for string equality using the = operator. Surely there must be a reason for this?

Some Benchmarking

A trivial program,

program Project1;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  RejbrandCommon;

const
  N = 1000000;

var
  Strings1, Strings2: StringArray;
  i: integer;
  b: {dummy }boolean;

procedure CreateRandomStringArrays;
var
  i: integer;
begin
  SetLength(Strings1, N);
  SetLength(Strings2, N);
  for i := 0 to N - 1 do
  begin
    Strings1[i] := RandomString(0, 40);
    Strings2[i] := RandomString(0, 40);
  end;
end;

begin

  CreateRandomStringArrays;

  StartClock;
  for i := 0 to N - 1 do
    if Strings1[i] = Strings2[i] then
      b := not b;
  StopClock;
  OutputClock;

  StartClock;
  for i := 0 to N - 1 do
    if SameStr(Strings1[i], Strings2[i]) then
      b := not b;
  StopClock;
  OutputClock;

  Pause;

end.

where

function RandomString(const LowerLimit: integer = 2; const UpperLimit: integer = 20): string;
var
  N, i: integer;
begin
  N := RandomRange(LowerLimit, UpperLimit);
  SetLength(result, N);
  for i := 1 to N do
    result[i] := RandomChar;
end;

and the inlined

function RandomChar: char;
begin
  result := chr(RandomRange(ord('A'), ord('Z')));
end;

and the "clock" functions just wrap QueryPerformanceCounter, QueryPerformanceFrequency, and Writeln, produces the output

2.56599325762716E-0002
1.24310093156453E-0002
ratio ~ 2.06

If the difference in length of the two strings that we compare is large, then the difference is even bigger. We try

Strings1[i] := RandomString(0, 0); // = '';
Strings2[i] := RandomString(0, 40);

and obtain

1.81630411160156E-0002
4.44662043198641E-0003
ratio ~ 4.08

So why doesn't the compiler use the SameStr code when writing assembly for S1 = S2?

Update

After reading Cosmin Prund's excellent answer, I couldn't resist setting

Strings1[i] := RandomString(40, 40);
Strings2[i] := RandomString(40, 40);

to produce strings of equal length and indeed.

2.74783364614126E-0002
1.96818773095322E-0002
ratio ~ 1.40

Hm... SameStr still wins...

My Specs

CPU Brand String: Intel(R) Core(TM) i7 CPU         870  @ 2.93GHz
Memory: 6 GB
OS: Windows 7 Home Premium (64-bit)
Compiler/RTL: Delphi 2009

Update

It would seem (see the comments below Cosmin Prund's excellent answer) like the = operator was changed between D2009 and D2010. Can anyone confirm this?

like image 597
Andreas Rejbrand Avatar asked Jul 12 '10 16:07

Andreas Rejbrand


2 Answers

Answer

It all depends on how you're building the random strings. I used a modified version of the code, because very few of us have the RejbrandCommon unit, and because I wanted to use Excel to finish my analyses (and make pretty pictures).

Code (skip over the code to see some conclusions):

program Project3;

{$APPTYPE CONSOLE}

uses
  SysUtils, Windows;

const
  StringsNumber = 2000000;

var
  Strings1, Strings2: array of string;
  StrLen: integer;
  b: {dummy }boolean;

function RandomString(MinLen, MaxLen:Integer):string;
var N, i:Integer;
begin
  N := MinLen + Random(MaxLen-MinLen);
  Assert(N >= MinLen); Assert(N <= MaxLen);
  SetLength(Result, N);
  for i:=1 to N do
    Result[i] := Char(32 + Random(1024)); // Random Unicode Char
end;

procedure CreateRandomStringArrays(StrLen:Integer);
var
  i: integer;
  StrLen2:Integer;
begin
  SetLength(Strings1, StringsNumber);
  SetLength(Strings2, StringsNumber);
  for i := 0 to StringsNumber - 1 do
  begin
    StrLen2 := StrLen + Random(StrLen div 2);
    Strings1[i] := RandomString(StrLen, StrLen2);
    StrLen2 := StrLen + Random(StrLen div 2);
    Strings2[i] := RandomString(StrLen, StrLen2);
  end;
end;

var C1, C2, C3, C4:Int64;

procedure RunTest(StrLen:Integer);
var i:Integer;
begin
  CreateRandomStringArrays(StrLen);

  // Test 1: using equality operator
  QueryPerformanceCounter(C1);
  for i := 0 to StringsNumber - 1 do
    if Strings1[i] = Strings2[i] then
      b := not b;
  QueryPerformanceCounter(C2);

  // Test 2: using SameStr
  QueryPerformanceCounter(C3);
  for i := 0 to StringsNumber - 1 do
    if SameStr(Strings1[i], Strings2[i]) then
      b := not b;
  QueryPerformanceCounter(C4);

  // Results:
  C2 := C2 - C1;
  C4 := C4 - C3;
  WriteLn(IntToStr(StrLen) + #9 + IntToStr(C2) + #9 + IntToStr(C4));
end;

begin

  WriteLn('Count'#9'='#9'SameStr');
  for StrLen := 1 to 50 do
    RunTest(StrLen);

end.

I made the CreateRandomStringArrays routine take an StrLen parameter so I can run multiple similar tests in a loop. I made the code use QueryPerformanceCounter directly and WriteLn the results in tab-delimited format so I can copy/paste it into Excel. In Excel I get results in this form:

StrLen  =   SameStr
1   61527   69364
2   60188   69450
3   72130   68891
4   78847   85779
5   77852   78286
6   83612   88670
7   93936   96773

I then normalized things a bit. On each line made the maximum value "1" and the other value an percentage of 1. The result looks like this:

StrLen  =   SameStr
1   0,88    1
2   0,86    1
3   1   0,95
4   0,91    1
5   0,99    1
6   0,94    1
7   0,97    1

And then I started playing with the CreateRandomStringArrays routine to run multiple tests.

This is how the plot looks like for the original case (CreateRandomStringArrays generates strings of random length, of length 1 to whatever's on the X axis). Blue is the result for the "=" operator, red is the result for the "SameStr", lower is better. It clear shows SameStr() has an edge for strings longer then 10 chars.

alt text http://fisiere.sediu.ro/PentruForumuri/V1_1_to_maxlen.png

Next test, made CreateRandomStringArrays return strings of EQUAL length. The content of the strings is still fully-random, but the length of the strings is equal to whatever is on the X axis. This time the "=" operator is clearly more efficient:

alt text http://fisiere.sediu.ro/PentruForumuri/V1_equal_strings.png

Now the real question is, with REAL code, what's the probability of strings being equal? And how large has to be the difference for SameStr() to start gaining terrain? Next text, I'm building two strings, the first one's of StrLen (the number on the X axis), the second string has a length of StrLen + Random(4). Again, the "=" operator is better:

alt text http://fisiere.sediu.ro/PentruForumuri/V1_rnd_p4.png

Next test, I have two strings, each of length: StrLen + Random(StrLen div 10). The "=" operator is better.

alt text http://fisiere.sediu.ro/PentruForumuri/V1_rnd_pm_10p.png

... and my final test, strings of +/- 50% length. Formula: StrLen + Random(StrLen div 2). The SameStr() wins this round:

alt text http://fisiere.sediu.ro/PentruForumuri/V1_rnd_pm_50p.png

Conclusion

I'm not sure. I didn't expect this to be linked to string length! I'd expect both functions to handle strings of different lengths lightning fast, but it doesn't happen.

like image 135
Cosmin Prund Avatar answered Sep 28 '22 06:09

Cosmin Prund


SameStr has an optional third parameter: LocaleOptions. You get the behaviour similar to "=" by leaving out the third parameter: a case senstive locale independent comparison.

You would think that is the same as a binary comparison, but it isn't.

Since D2009 Delphi strings have an "code page" payload in addition to the length and the refcount.

  StrRec = packed record
    codePage: Word;
    elemSize: Word;
    refCnt: Longint;
    length: Longint;
  end;

When you do a String1 = String2 you are telling the compiler to ignore all the information about the string and simply do a binary comparison (it uses UStrEqual for that).

When you do a SameStr or CompareStr (which is used by SameStr) Delphi will first check the string for being Unicode (UTF-16LE) and if not, convert them before doing the actual work.

You can see this when you look at the implementation of CompareStr (the one without the third parameter) which, after initial optimisations, checks whether the arguments are unicode strings and if not, converts them using UStrFromLStr.

Update:

Actually, UStrEqual (by means of UStrCmp) also does conversions, like CompareStr it looks at the elemSize of the strings to decide whether they are Unicode or not and converts them if they aren't.

So the reason why the compiler does not use SameStr (CompareStr) for the = operator eludes me at the moment. The only thing I can think of is that it has a nice analogy to the LStrEqual used to '='-compare AnsiStrings. I guess only the compiler people know.

Sorry to have wasted your time. I am leaving the answer though, so others won't have to go down this route of investigation.

like image 43
Marjan Venema Avatar answered Sep 28 '22 05:09

Marjan Venema