Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi XE - RawByteString vs AnsiString

Tags:

unicode

delphi

I had a similar question to this here: Delphi XE - should I use String or AnsiString? . After deciding that it is right to use ANSI strings in a (large) library of mine, I have realized that I can actually use RawByteString instead of ANSI. Because I mix UNICODE strings with ANSI strings, my code now has quite few places where it does conversions between them. However, it looks like if I use RawByteString I get rid of those conversions.

Please let me know your opinion about it.
Thanks.


Update:
This seems to be disappointing. It looks like the compiler still makes a conversion from RawByteString to string.

procedure TForm1.FormCreate(Sender: TObject);
var x1, x2: RawByteString;
    s: string;
begin
  x1:= 'a';
  x2:= 'b';
  x1:= x1+ x2;
  s:= x1;              {      <------- Implicit string cast from 'RawByteString' to 'string'     }
end;

I think it does some internal workings (such as copying data) and my code will not be much faster and I will still have to add lots of typecasts in my code in order to silence the compiler.

like image 392
Server Overflow Avatar asked May 19 '11 16:05

Server Overflow


1 Answers

RawByteString is an AnsiString with no code page set by default.

When you assign another string to this RawByteString variable, you'll copy the code page of the source string. And this will include a conversion. Sorry.

But there is one another use of RawByteString, which is to store plain byte content (e.g. a database BLOB field content, just like an array of byte)

To summarize:

  • RawByteString should be used as a "code page agnostic" parameter to a method or function;
  • RawByteString can be used as a variable type to store some BLOB data.

If you want to reduce conversion, and would rather use 8 bit char string in your application, you should better:

  • Do not use the generic AnsiString type, which will depend on the current system code page, and by which you'll loose data;
  • Rely on UTF-8 encoding, i.e. some 8 bit code page / charset which won't loose any data when converted from or to an UnicodeString;
  • Don't let the compiler show warnings about implicit conversions: all conversion should be made explicit;
  • Use your own dedicated set of functions to handle your UTF-8 content.

That exactly what we made for our framework. We wanted to use UTF-8 in its kernel because:

  • We rely on UTF-8 encoded JSON for data transmission;
  • Memory consumption will be smaller;
  • The used SQLite3 engine will store text as UTF-8 in its database file;
  • We wanted a way of handling Unicode text with no loose of data with all versions of Delphi (from Delphi 6 up to XE), and WideString was not an option because it's dead slow and you've got the same problem of implicit conversions.

But, in order to achieve best speed, we write some optimized functions to handle our custom string type:

  {{ RawUTF8 is an UTF-8 String stored in an AnsiString
    - use this type instead of System.UTF8String, which behavior changed
     between Delphi 2009 compiler and previous versions: our implementation
     is consistent and compatible with all versions of Delphi compiler
    - mimic Delphi 2009 UTF8String, without the charset conversion overhead
    - all conversion to/from AnsiString or RawUnicode must be explicit }
{$ifdef UNICODE} RawUTF8 = type AnsiString(CP_UTF8); // Codepage for an UTF8string
{$else}          RawUTF8 = type AnsiString; {$endif}

/// our fast RawUTF8 version of Trim(), for Unicode only compiler
// - this Trim() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Trim(const S: RawUTF8): RawUTF8;

/// our fast RawUTF8 version of Pos(), for Unicode only compiler
// - this Pos() is seldom used, but this RawUTF8 specific version is needed
// by Delphi 2009/2010/XE, to avoid two unnecessary conversions into UnicodeString
function Pos(const substr, str: RawUTF8): Integer; overload; inline;

And we reserved the RawByteString type for handling BLOB data:

{$ifndef UNICODE}
  /// define RawByteString, as it does exist in Delphi 2009/2010/XE
  // - to be used for byte storage into an AnsiString
  // - use this type if you don't want the Delphi compiler not to do any
  // code page conversions when you assign a typed AnsiString to a RawByteString,
  // i.e. a RawUTF8 or a WinAnsiString
  RawByteString = AnsiString;
  /// pointer to a RawByteString
  PRawByteString = ^RawByteString;
{$endif}

/// create a File from a string content
// - uses RawByteString for byte storage, thatever the codepage is
function FileFromString(const Content: RawByteString; const FileName: TFileName;
  FlushOnDisk: boolean=false): boolean;

Source code is available in our repository. In this unit, UTF-8 related functions were deeply optimized, with both version in pascal and asm for better speed. We sometimes overloaded default functions (like Pos) to avoid conversion, or More information about how we handled text in the framework is available here.

Last word:

If you are sure that you will only have 7 bit content in your application (no accentuated characters), you may use the default AnsiString type in your program. But in this case, you should better add the AnsiStrings unit in your uses clause to have overloaded string functions which will avoid most unwanted conversion.

like image 117
Arnaud Bouchez Avatar answered Dec 07 '22 20:12

Arnaud Bouchez