Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Safe concatenation of large string in Delphi

I perform operations on quite large strings - I search for occurrences of given phrases, and do various jobs, on let's call it "database" (I prepare a file with data for further processing in R) using properly two procedures / functions: Pos and StringReplace. Most of them have a size of about 20-30 mb, they are sometimes larger.

From documentation here - i know that all strings declared as "String", for example:

my_string : String;

are "Note: In RAD Studio, string is an alias for UnicodeString". Which means I shouldn't worry about their size or memory allocation, because RAD will do it automatically. Of course, at this stage I could ask a question - do you think that the choice of declaration is important for the compiler and affects the behavior of strings, since they are technically the same?

my_string1 : String;
my_string2 : AnsiString;
my_string3 : UnicodeString;

It has some significance for size and allocation, length etc. (we are talking about thongs over 20 MB)?

And now the most important question - how to safely combine two large strings with each other? Safe for memory leaks and string content, secure for program speed, etc. Here are two options:

> var string1, string2: String;
> ...
> string1 := string1 + string2;

where the documentation here and here indicates that this is the way to concatenate Strings in Delphi. But there is another way - I can set a very large string size in advance and move second one content with the move procedure.

const string_size: Integer = 1024*1024;
var string1, string2: String;
    concat_place: Integer = 1;
...
SetLength(string1, string_size);
Move(string2[1],string1[concat_place],Length(string2));
Inc(concat_place,Length(string2));

which seems to be much safer, because the area (size) of this string in the memory does not change dynamically, I just move the appropriate values to it. Is this a better idea? Or are they even better? Maybe I don't understand something?

And bonus question - I tested for both String and AnsiString searches using Pos and AnsiPos. They seem to work the same in all combinations. Does this mean that they are now identical in Delphi?

Thank you in advance for all tips.

like image 636
kwadratens Avatar asked Dec 08 '22 11:12

kwadratens


1 Answers

In Delphi, strings have always been managed by the compiler.

In practice, this means that the programmer need not to worry at all about their memory allocation or lifetime, and there will be no (unexpected) memory leaks. Strings are as easy and safe to use as plain integers (unless you start doing very strange things).

Under the hood, a string variable is a pointer to a string data structure, and strings are reference-counted and use copy-on-write semantics. Although you very likely do not need the details, they are documented.

Before Delphi 2009, strings were not Unicode: they used one byte per character, and thus only 255 non-null characters were available, determined by the current code page. These were hard times.

In Delphi 2009 and later, strings are Unicode strings, with two bytes per character. Hence, it is now possible to encode strings like "∑γ + ∫sin²x dx" without effort, and you never need to worry about code pages.

You implied that you believe that the following declarations are the same:

MyString1: string;
MyString2: AnsiString;
MyString3: UnicodeString;

Well, in Delphi 2009, UnicodeString and string are the same: they are Unicode strings with two bytes per character. However, AnsiString is the old (legacy, pre-2009) string type, which uses a single byte per character (max 255 non-null characters) and depends on a code page. Try storing "∑γ + ∫sin²x dx" in an AnsiString!

And now the most important question - how to safely combine two large strings with each other? Safe for memory leaks and string content, secure for program speed, etc.

To combine two strings in Delphi, you almost always use the + operator: MyString1 + MyString2. This is 100% safe in terms on correctness, memory management, etc. There will not be any memory leaks. Concatenating strings in Delphi is this simple.

However, in terms of speed, there are situations where you might be able to improve on this. The + operator will cause the compiler to create code for making a new internal string data structure and copying the contents of MyString1 and MyString2 to that new area.

So, for instance, if you want to build a large string by concatenating many smaller strings (or even single characters), you might gain (a lot of) performance by not using successive + operations, but instead at the start allocating a sufficiently large result string (using SetLength and a character count) and copying the characters/strings to it manually (e.g., using Move and a byte count).

Notice that I emphasise the word byte: your example,

Move(string2[1], string1[concat_place], Length(string2));

likely doesn't do what you expect. Since the strings are declared as string, in Delphi 2009 and later, they are Unicode strings and so there are two bytes per character. So you need to copy 2*Length(string2) bytes. To be safe, I'd write

Move(string2[1], string1[concat_place], sizeof(char) * Length(string2));

This code will work both in pre-2009 and post-2009 Delphi versions, assuming the strings are declared as string. Before Delphi 2009, sizeof(char) is 1; in Delphi 2009 and later, sizeof(char) is 2.

As a simple benchmark, I tried

function GetChar: char;
begin
  Result := Char(1 + Random(1000));
end;

const
  N = 100000000;

function MakeString1: string;
var
  i: Integer;
begin
  Result := '';
  for i := 1 to N do
    Result := Result + GetChar;
end;

function MakeString2: string;
var
  i: Integer;
begin
  SetLength(Result, N);
  for i := 1 to N do
    Result[i] := GetChar;
end;

procedure TForm1.FormCreate(Sender: TObject);
var
  f, c1, c2: Int64;
  dur1, dur2: Double;
  s1, s2: string;
begin

  QueryPerformanceFrequency(f);

  QueryPerformanceCounter(c1);
  s1 := MakeString1;
  QueryPerformanceCounter(c2);
  dur1 := (c2 - c1) / f;

  QueryPerformanceCounter(c1);
  s2 := MakeString2;
  QueryPerformanceCounter(c2);
  dur2 := (c2 - c1) / f;

  ShowMessage(dur1.ToString + sLineBreak + dur2.ToString);

end;

On my system, MakeString1 completes in 5 seconds and MakeString2 in 1 second.

like image 77
Andreas Rejbrand Avatar answered Dec 24 '22 19:12

Andreas Rejbrand