Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting UnicodeString to AnsiString

In the olden times, i had a function that would convert a WideString to an AnsiString of the specified code-page:

function WideStringToString(const Source: WideString; CodePage: UINT): AnsiString;
...
begin
   ...
    // Convert source UTF-16 string (WideString) to the destination using the code-page
    strLen := WideCharToMultiByte(CodePage, 0,
        PWideChar(Source), Length(Source), //Source
        PAnsiChar(cpStr), strLen, //Destination
        nil, nil);
    ...
end;

And everything worked. I passed the function a unicode string (i.e. UTF-16 encoded data) and converted it to an AnsiString, with the understanding that the bytes in the AnsiString represented characters from the specified code-page.

For example:

TUnicodeHelper.WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);

would return the Windows-1252 encoded string:

The qùíçk brown fôx jumped ovêr the lázÿ dog

Note: Information was of course lost during the conversion from the full Unicode character set to the limited confines of the Windows-1252 code page:

  • Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ (before)
  • The qùíçk brown fôx jumped ovêr the lázÿ dog (after)

But the Windows WideChartoMultiByte does a pretty good job of best-fit mapping; as it is designed to do.

Now the after times

Now we are in the after times. WideString is now a pariah, with UnicodeString being the goodness. It's an inconsequential change; as the Windows function only needed a pointer to a series of WideChar anyway (which a UnicodeString also is). So we change the declaration to use UnicodeString instead:

funtion WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
begin
   ...
end;

Now we come to the return value. i have an AnsiString that contains the bytes:

54 68 65 20 71 F9 ED E7  The qùíç
6B 20 62 72 6F 77 6E 20  k brown 
66 F4 78 20 6A 75 6D 70  fôx jump
65 64 20 6F 76 EA 72 20  ed ovêr 
74 68 65 20 6C E1 7A FF  the lázÿ
20 64 6F 67               dog

In the olden times that was fine. I kept track of what code-page the AnsiString actually contained; i had to remember that the returned AnsiString was not encoded using the computer's locale (e.g. Windows 1258), but instead is encoded using another code-page (the CodePage code page).

But in Delphi XE6 an AnsiString also secretly contains the codepage:

  • codePage: 1258
  • length: 44
  • value: The qùíçk brown fôx jumped ovêr the lázÿ dog

This code-page is wrong. Delphi is specifying the code-page of my computer, rather than the code-page that the string is. Technically this is not a problem, i always understood that the AnsiString was in a particular code-page, i just had to be sure to pass that information along.

So when i wanted to decode the string, i had to pass along the code-page with it:

s := TUnicodeHeper.StringToWideString(s, 1252);

with

function StringToWideString(s: AnsiString; CodePage: UINT): UnicodeString;
begin
   ...
   MultiByteToWideChar(...);
   ...
end;

Then one person screws everything up

The problem was that in the olden times i declared a type called Utf8String:

type
   Utf8String = type AnsiString;

Because it was common enough to have:

function TUnicodeHelper.WideStringToUtf8(const s: UnicodeString): Utf8String;
begin
   Result := WideStringToString(s, CP_UTF8);
end;

and the reverse:

function TUnicodeHelper.Utf8ToWideString(const s: Utf8String): UnicodeString;
begin
   Result := StringToWideString(s, CP_UTF8);
end;

Now in XE6 i have a function that takes a Utf8String. If some existing code somewhere were take a UTF-8 encoded AnsiString, and try to convert it to UnicodeString using Utf8ToWideString it would fail:

s: AnsiString;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);

...

 ws: UnicodeString;
 ws := Utf8ToWideString(s); //Delphi will treat s an CP1252, and convert it to UTF8

Or worse, is the breadth of existing code that does:

s: Utf8String;
s := UnicodeStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', CP_UTF8);

The returned string will become totally mangled:

  • the function returns AnsiString(1252) (AnsiString tagged as encoded using the current codepage)
  • the return result is being stored in an AnsiString(65001) string (Utf8String)
  • Delphi converts the UTF-8 encoded string into UTF-8 as though it was 1252.

How to move forward

Ideally my UnicodeStringToString(string, codePage) function (which returns an AnsiString) could set the CodePage inside the string to match the actual code-page using something like SetCodePage:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): AnsiString;
begin
   ...
   WideCharToMultiByte(...);
   ...

   //Adjust the codepage contained in the AnsiString to match reality
   //SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
   if Length(Result) > 0 then
      PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;

Except that manually mucking around with the internal structure of an AnsiString is horribly dangerous.

So what about returning RawByteString?

It has been said, over an over, by a lot of people who aren't me that RawByteString is meant to be the universal recipient; it wasn't meant to be as a return parameter:

function UnicodeStringToString(s: UnicodeString; CodePage: UINT): RawByteString;
begin
   ...
   WideCharToMultiByte(...);
   ...

   //Adjust the codepage contained in the AnsiString to match reality
   SetCodePage(Result, CodePage, False); SetCodePage only works on RawByteString
end;

This has the virtue of being able to use the supported and documented SetCodePage.

But if we're going to cross a line, and start returning RawByteString, surely Delphi already has a function that can convert a UnicodeString to a RawByteString string and vice versa:

function WideStringToString(const s: UnicodeString; CodePage: UINT): RawByteString;
begin
   Result := SysUtils.Something(s, CodePage);
end;

function StringToWideString(const s: RawByteString; CodePage: UINT): UnicodeString;
begin
   Result := SysUtils.SomethingElse(s, CodePage);       
end;

But what is it?

Or what else should i do?

This was a long-winded set of background for a trivial question. The real question is, of course, what should i be doing instead? There is a lot of code out there that depends on the UnicodeStringToString and the reverse.

tl;dr:

I can convert a UnicodeString to UTF by doing:

Utf8Encode('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

and i can convert a UnicodeString to the current code-page by using:

AnsiString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ');

But how do i convert a UnicodeString to an arbitrary (unspecified) code-page?

My feeling is that since everything really is an AnsiString:

Utf8String = AnsiString(65001);
RawByteString = AnsiString(65535);

i should bite the bullet, bust open the AnsiString structure, and poke the correct code-page into it:

function StringToAnsi(const s: UnicodeString; CodePage: UINT): AnsiString;
begin
   LocaleCharsFromUnicode(CodePage, ..., s, ...);

   ...

   if Length(Result) > 0 then
      PStrRec(PByte(Result) - SizeOf(StrRec)).codePage := CodePage;
end;

Then the rest of the VCL will fall in line.

like image 389
Ian Boyd Avatar asked Nov 12 '14 17:11

Ian Boyd


3 Answers

In this particular case, using RawByteString is an appropriate solution:

function WideStringToString(const Source: UnicodeString; CodePage: UINT): RawByteString;
var
  strLen: Integer;
begin
  strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
  if strLen > 0 then
  begin
    SetLength(Result, strLen);
    LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
    SetCodePage(Result, CodePage, False);
  end;
end;

This way, the RawByteString holds the codepage, and assigning the RawByteString to any other string type, whether that be AnsiString or UTF8String or whatever, will allow the RTL to automatically convert the RawByteString data from its current codepage to the destination string's codepage (which includes conversions to UnicodeString).

If you absolutely must return an AnsiString (which I do not recommend), you can still use SetCodePage() via a typecast:

function WideStringToString(const Source: UnicodeString; CodePage: UINT): AnsiString;
var
  strLen: Integer;
begin
  strLen := LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), nil, 0, nil, nil));
  if strLen > 0 then
  begin
    SetLength(Result, strLen);
    LocaleCharsFromUnicode(CodePage, 0, PWideChar(Source), Length(Source), PAnsiChar(Result), strLen, nil, nil));
    SetCodePage(PRawByteString(@Result)^, CodePage, False);
  end;
end;

The reverse is much easier, just use the codepage already stored in a (Ansi|RawByte)String (just make sure those codepages are always accurate), since the RTL already knows how to retrieve and use the codepage for you:

function StringToWideString(const Source: AnsiString): UnicodeString;
begin
  Result := UnicodeString(Source);
end;

function StringToWideString(const Source: RawByteString): UnicodeString;
begin
  Result := UnicodeString(Source);
end;

That being said, I would suggest dropping the helper functions altogether and just use typed strings instead. Let the RTL handle conversions for you:

type
  Win1252String = type AnsiString(1252);

var
  s: UnicodeString;
  a: Win1252String;
begin
  s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
  a := Win1252String(s);
  s := UnicodeString(a);
end;

var
  s: UnicodeString;
  u: UTF8String;
begin
  s := 'Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ';
  u := UTF8String(s);
  s := UnicodeString(u);
end;
like image 143
Remy Lebeau Avatar answered Nov 15 '22 06:11

Remy Lebeau


I think that returning a RawByteString is probably as good as you'll get. You could do it using AnsiString as you outlined but RawByteString captures the intent better. In this scenario a RawByteString morally counts as a parameter in the sense of the official Embarcadero advice. It is just an output rather than an input. The real key is not to use it as a variable.

You could code it like this:

function MBCSString(const s: UnicodeString; CodePage: Word): RawByteString;
var
  enc: TEncoding;
  bytes: TBytes;
begin
  enc := TEncoding.GetEncoding(CodePage);
  try
    bytes := enc.GetBytes(s);
    SetLength(Result, Length(bytes));
    Move(Pointer(bytes)^, Pointer(Result)^, Length(bytes));
    SetCodePage(Result, CodePage, False);
  finally
    enc.Free;
  end;
end;

Then

var
  s: AnsiString;
....
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1252);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 1251);
Writeln(StringCodePage(s));
s := MBCSString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 65001);
Writeln(StringCodePage(s));

outputs 1252, 1251, and then 65001 as you would expect.

And you could use LocaleCharsFromUnicode if you prefer. Of course, you need to take its documentation with a pinch of salt: LocaleCharsFromUnicode is a wrapper for the WideCharToMultiByte function. Amazing that text was ever written since LocaleCharsFromUnicode surely only exists to be cross-platform.


However, I wonder if you may be making a mistake in attempting to keep ANSI encoded text in AnsiString variables in your program. Normally you would encoded to ANSI as late as possible (at the interop boundary), and likewise decode as early as possible.

If you simply have to do this then perhaps there is a better solution that avoids the dreaded AnsiString completely. Instead of storing the text in an AnsiString, store it in TBytes. You already have data structures that keep track of encoding, so why not keep them. Replace the record that contains code page and AnsiString with one containing code page and TBytes. Then you would have no fear of anything recoding your text behind your back. And your code will be ready to use on the mobile compilers.

like image 26
David Heffernan Avatar answered Nov 15 '22 07:11

David Heffernan


Grovelling through System.pas, i found the built-in function SetAnsiString that does what i want:

procedure SetAnsiString(Dest: _PAnsiStr; Source: PWideChar; Length: Integer; CodePage: Word);

It's also important to note that this function does push the CodePage into the internal StrRec structure for me:

PStrRec(PByte(Dest) - SizeOf(StrRec)).codePage := CodePage;

This allows me to write something like:

function WideStringToString(const s: UnicodeString; DestinationCodePage: Word): AnsiString;
var
   strLen: Integer;
begin
   strLen := Length(Source);

   if strLen = 0 then
   begin
      Result := '';
      Exit;
   end;

   //Delphi XE6 has a function to convert a unicode string to a tagged AnsiString
   SetAnsiString(@Result, @Source[1], strLen, DestinationCodePage);
end;

So when i call:

actual := WideStringToString('Ŧĥε qùíçķ ƀřǭŵņ fôx ǰűmpεď ōvêŗ ţħě łáƶÿ ďơǥ', 850);

i get the resulting AnsiString:

codePage: $0352 (850)
elemSize: $0001 (1)
refCnt:   $00000001 (1)
length:   $0000002C (44)
contents: 'The qùíçk brown fôx jumped ovêr the láZÿ dog' 

An AnsiString with the appropriate code-page already stuffed in the secret codePage member.

The other way

class function TUnicodeHelper.ByteStringToUnicode(const Source: RawByteString; CodePage: UINT): UnicodeString;
var
    wideLen: Integer;
    dw: DWORD;
begin
{
    See http://msdn.microsoft.com/en-us/library/dd317756.aspx
    Code Page Identifiers
    for a list of code pages supported in Windows.

    Some common code pages are:
        CP_UTF8 (65001) utf-8               "Unicode (UTF-8)"
        CP_ACP  (0)                         The system default Windows ANSI code page.
        CP_OEMCP    (1)                         The current system OEM code page.
        1252                    Windows-1252    "ANSI Latin 1; Western European (Windows)", this is what most of us in north america use in Windows
        437                 IBM437          "OEM United States", this is your "DOS fonts"
        850                 ibm850          "OEM Multilingual Latin 1; Western European (DOS)", the format accepted by Fincen for LCTR/STR
        28591                   iso-8859-1      "ISO 8859-1 Latin 1; Western European (ISO)", Windows-1252 is a super-set of iso-8859-1, adding things like euro symbol, bullet and ellipses
        20127                   us-ascii            "US-ASCII (7-bit)"
}
    if Length(Source) = 0 then
    begin
        Result := '';
        Exit;
    end;

    // Determine real size of final, string in symbols
//  wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
    wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), nil, 0);
    if wideLen = 0 then
    begin
        dw := GetLastError;
        raise EConvertError.Create('[StringToWideString] Could not get wide length of UTF-16 string. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
    end;

    // Allocate memory for UTF-16 string
    SetLength(Result, wideLen);

    // Convert source string to UTF-16 (WideString)
//  wideLen := MultiByteToWideChar(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(wideStr), wideLen);
    wideLen := UnicodeFromLocaleChars(CodePage, 0, PAnsiChar(Source), Length(Source), PWChar(Result), wideLen);
    if wideLen = 0 then
    begin
        dw := GetLastError;
        raise EConvertError.Create('[StringToWideString] Could not convert string to UTF-16. Error '+IntToStr(dw)+' ('+SysErrorMessage(dw)+')');
    end;
end;

Note: Any code released into public domain. No attribution required.

like image 29
Ian Boyd Avatar answered Nov 15 '22 07:11

Ian Boyd