Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi 2009 RawByteString vagaries

Suppose that for some perverse reason you want to display the raw byte contents of a UTF8String.

var
  utf8Str : UTF8String;
begin    
  utf8Str := '€ąćęłńóśźż';
end;

(1) This doesn't do, it displays the readable form:

memo1.Lines.Add( RawByteString( utf8Str ));
// output: '€ąćęłńóśźż'

(2) This, however, does "work" - note the concatenation:

memo1.Lines.Add( 'x' + RawByteString( utf8Str ));
// output: 'x€ąćęłńóśźż'

I understand (1), though the compiler's forced coerction to UnicodeString seems to prevent ever displaying a RawByteString var as-is. However, why does the behavior change in (2)?

(3) Stranger still - let's reverse the concatenation:

memo1.Lines.Add( RawByteString( utf8Str ) + 'x' ); 
// output: '€ąćęłńóśźżx'

I've been reading up on the newfangled string types in Delphi and thought I understood how they work, but this is a puzzle.

like image 780
Marek Jedliński Avatar asked Jan 31 '09 05:01

Marek Jedliński


2 Answers

RawByteString only exists to minimize the number of overloads required for functions that work with various flavours of AnsiStrings with different codepage affinities.

In general, don't declare variables of type RawByteString. Don't typecast values to that type. Don't do concatenations on variables of that type. About the only things you can do are:

  • Declaring a parameter of this type (the original intent)
  • Indexing on such a parameter
  • Searching in such a parameter
  • Intelligent operations that check the actual code page of the string, using the StringCodePage function.

For example, you'll note that the StringCodePage function itself uses RawByteString as its argument type. This way, it will work with any AnsiString, rather than doing a codepage translation before passing it as an argument.

For your case, things like concatenations are largely undefined. The behaviour changed between RTM and Update 2, but when the RTL string concatenation functions receive multiple strings with different code pages, there's no easy way for it to figure out what code page should be used for the final string. That's just one reason why you shouldn't concatenate them like you do here.

like image 99
Barry Kelly Avatar answered Sep 28 '22 21:09

Barry Kelly


You cannot add a string to a TMemo "as is". You always need to so some kind of conversion to Unicode, because that's all TMemo knows about in Delphi 2009.

If you want to pretend that your UTF8String uses code page 1252, do this:

var
  utf8Str : UTF8String;
  Raw: RawByteString;
begin
  utf8Str := '€ąćęłńóśźż';
  Raw := utf8Str;
  SetCodePage(Raw, 1252, False);
  Memo.Lines.Add(Raw);
end;

For more details, see my article Using RawByteString Effectively

like image 22
Jan Goyvaerts Avatar answered Sep 28 '22 22:09

Jan Goyvaerts