Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFile.ReadAllText with TEncoding.UTF8 omits first 3 chars

I have a UTF-8 text file which starts with this line:

<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>

When I read this file with TFile.ReadAllText with TEncoding.UTF8:

MyStr := TFile.ReadAllText(ThisFileNamePath, TEncoding.UTF8);

then the first 3 characters of the text file are omitted, so MyStr results in:

'AD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...'

However, when I read this file with TFile.ReadAllText without TEncoding.UTF8:

MyStr := TFile.ReadAllText(ThisFileNamePath);

then the file is read completely and correctly:

<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...

Does TFile.ReadAllText have a bug?

like image 586
user1580348 Avatar asked Jun 16 '13 12:06

user1580348


1 Answers

The first three bytes are skipped because the RTL code assumes that the file contains a UTF-8 BOM. Clearly your file does not.

The TUTF8Encoding class implements a GetPreamble method that specifies the UTF-8 BOM. And ReadAllBytes skips the preamble specified by the encoding that you pass.

One simple solution would be to read the file into a byte array and then use TEncoding.UTF8.GetString to decode it into a string.

var
  Bytes: TBytes;
  Str: string;
....
Bytes := TFile.ReadAllBytes(FileName);
Str := TEncoding.UTF8.GetString(Bytes);

An more comprehensive alternative would be to make a TEncoding instance that ignored the UTF-8 BOM.

type
  TUTF8EncodingWithoutBOM = class(TUTF8Encoding)
  public
    function Clone: TEncoding; override;
    function GetPreamble: TBytes; override;
  end;

function TUTF8EncodingWithoutBOM.Clone: TEncoding;
begin
  Result := TUTF8EncodingWithoutBOM.Create;
end;

function TUTF8EncodingWithoutBOM.GetPreamble: TBytes;
begin
  Result := nil;
end;

Instantiate one of these (you only need one instance per process) and pass it to TFile.ReadAllText.

The advantage of using a singleton instance of TUTF8EncodingWithoutBOM is that you can use it anywhere that expects a TEncoding.

like image 183
David Heffernan Avatar answered Sep 28 '22 00:09

David Heffernan