Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unicode text file output differs between XE2 and Delphi 2009?

When I try the code below there seem to be different output in XE2 compared to D2009.

procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
    myByte: Byte;

begin
  assignfile(Outfile,'test_chinese.txt');
  Rewrite(Outfile);

  for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
  //This is the UTF-8 BOM

  Writeln(Outfile,utf8string('总结'));
  Writeln(Outfile,'°C');
  Closefile(Outfile);
end;

Compiling with XE2 on a Windows 8 PC gives in WordPad

?? C

txt hex code: EF BB BF 3F 3F 0D 0A B0 43 0D 0A

Compiling with D2009 on a Windows XP PC gives in Wordpad

总结 °C

txt hex code: EF BB BF E6 80 BB E7 BB 93 0D 0A B0 43 0D 0A

My questions is why it differs and how can I save Chinese characters to a text file using the old text file I/O?

Thanks!

like image 808
Thomas Avatar asked Jan 09 '13 10:01

Thomas


2 Answers

In XE2 onwards, AssignFile() has an optional CodePage parameter that sets the codepage of the output file:

function AssignFile(var F: File; FileName: String; [CodePage: Word]): Integer; overload;

Write() and Writeln() both have overloads that support UnicodeString and WideChar inputs.

So, you can create a file that has its codepage set to CP_UTF8, and then Write/ln() will automatically convert Unicode strings to UTF-8 when writing them to the file.

The downside is that you will not be able to write the UTF-8 BOM using AnsiChar values anymore, because the individual bytes will get converted to UTF-8 and thus not be written correctly. You can get around that by writing the BOM as a single Unicode character (which it what it really is - U+FEFF) instead of as individual bytes.

This works in XE2:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TextFile;
begin
  AssignFile(Outfile, 'test_chinese.txt', CP_UTF8);
  Rewrite(Outfile);

  //This is the UTF-8 BOM
  Write(Outfile, #$FEFF);

  Writeln(Outfile, '总结');
  Writeln(Outfile, '°C');
  CloseFile(Outfile);
end;

With that said, if you want something that is more compatible and reliable between D2009 and XE2, use TStreamWriter instead:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TStreamWriter;
begin
  Outfile := TStreamWriter.Create('test_chinese.txt', False, TEncoding.UTF8);
  try
    Outfile.WriteLine('总结');
    Outfile.WriteLine('°C');
  finally
    Outfile.Free;
  end;
end;

Or do the file I/O manually:

procedure TForm1.Button1Click(Sender: TObject);
var
  Outfile: TFileStream;
  BOM: TBytes;

  procedure WriteBytes(const B: TBytes);
  begin
    if B <> '' then Outfile.WriteBuffer(B[0], Length(B));
  end;

  procedure WriteStr(const S: UTF8String);
  begin
    if S <> '' then Outfile.WriteBuffer(S[1], Length(S));
  end;

  procedure WriteLine(const S: UTF8String);
  begin
    WriteStr(S);
    WriteStr(sLineBreak);
  end;

begin
  Outfile := TFileStream.Create('test_chinese.txt', fmCreate);
  try
    WriteBytes(TEncoding.UTF8.GetPreamble);
    WriteLine('总结');
    WriteLine('°C');
  finally
    Outfile.Free;
  end;
end;
like image 175
Remy Lebeau Avatar answered Nov 15 '22 22:11

Remy Lebeau


You really shouldn't use the old text I/O anymore.

Anyway, you can use TEncoding to get the UTF-8 TBytes like this:

procedure TForm1.Button1Click(Sender: TObject);
var Outfile:textfile;
    Bytes: TBytes;
    myByte: Byte;
begin
  assignfile(Outfile,'test_chinese.txt');
  Rewrite(Outfile);

  for myByte in TEncoding.UTF8.GetPreamble do write(Outfile, AnsiChar(myByte));
  //This is the UTF-8 BOM

  Bytes := TEncoding.UTF8.GetBytes('总结');
  for myByte in Bytes do begin
    Write(Outfile, AnsiChar(myByte));
  end;

  Writeln(Outfile,'°C');
  Closefile(Outfile);
end;

I'm not sure if there is an easier way to write TBytes to a Textfile, maybe somebody else has a better idea.

Edit:

For a pure binary file (File instead of TextFile type) use can use BlockWrite.

like image 27
Jens Mühlenhoff Avatar answered Nov 15 '22 22:11

Jens Mühlenhoff