Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi: Encoding Strings as Python do

I want to encode strings as Python do.

Python code is this:

def EncodeToUTF(inputstr):
  uns = inputstr.decode('iso-8859-2')
  utfs = uns.encode('utf-8')
  return utfs

This is very simple.

But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).

I tried this test code to see the convertion:

procedure TForm1.Button1Click(Sender: TObject);
var
    w : WideString;
    buf : array[0..2048] of WideChar;
    i : integer;
    lc : Cardinal;
begin
    lc := GetThreadLocale;
    Caption := IntToStr(lc);
    StringToWideChar(Edit1.Text, buf, SizeOF(buf));
    w := buf;
    lc := MakeLCID(
        MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
        0);
    Win32Check(SetThreadLocale(lc));
    Edit2.Text := WideCharToString(PWideChar(w));
    Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;

The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase. The local lc is 1038 (hun), the new lc is 1033.

But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.

What I do wrong? How to I do same thing as Python do?

Thanks for every help, link, etc: dd

like image 338
durumdara Avatar asked Sep 07 '10 11:09

durumdara


1 Answers

Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:

1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():

function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
  ret: Integer;
  uns: WideString;
begin
  Result := '';
  if inputstr = '' then Exit;
  ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
  if ret < 1 then Exit;
  SetLength(uns, ret);
  MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
  ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
  if ret < 1 then Exit;
  SetLength(Result, ret);
  WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;

2a) on D2009+, use SysUtils.TEncoding.Convert():

function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
  enc: TEncoding;
  buf: TBytes;
begin
  Result := '';
  if inputstr = '' then Exit;
  enc := TEncoding.GetEncoding(28592);
  try
    buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
    if Length(buf) > 0 then
      SetString(Result, PAnsiChar(@buf[0]), Length(buf));
  finally
    enc.Free;
  end;
end;

2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:

type
  Latin2String = type AnsiString(28592);

var
  inputstr: Latin2String;
  outputstr: UTF8String;
begin
  // put the ISO-8859-2 encoded bytes into inputstr, then...
  outputstr := inputstr;
end;
like image 83
Remy Lebeau Avatar answered Nov 17 '22 21:11

Remy Lebeau