I have a UTF-8 text file which starts with this line:
<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>
When I read this file with TFile.ReadAllText
with TEncoding.UTF8:
MyStr := TFile.ReadAllText(ThisFileNamePath, TEncoding.UTF8);
then the first 3 characters of the text file are omitted, so MyStr results in:
'AD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...'
However, when I read this file with TFile.ReadAllText
without TEncoding.UTF8:
MyStr := TFile.ReadAllText(ThisFileNamePath);
then the file is read completely and correctly:
<HEAD><META name=GENERATOR content="MSHTML 10.00.9200.16521"><body>...
Does TFile.ReadAllText
have a bug?
The first three bytes are skipped because the RTL code assumes that the file contains a UTF-8 BOM. Clearly your file does not.
The TUTF8Encoding
class implements a GetPreamble
method that specifies the UTF-8
BOM. And ReadAllBytes
skips the preamble specified by the encoding that you pass.
One simple solution would be to read the file into a byte array and then use TEncoding.UTF8.GetString
to decode it into a string.
var
Bytes: TBytes;
Str: string;
....
Bytes := TFile.ReadAllBytes(FileName);
Str := TEncoding.UTF8.GetString(Bytes);
An more comprehensive alternative would be to make a TEncoding
instance that ignored the UTF-8 BOM.
type
TUTF8EncodingWithoutBOM = class(TUTF8Encoding)
public
function Clone: TEncoding; override;
function GetPreamble: TBytes; override;
end;
function TUTF8EncodingWithoutBOM.Clone: TEncoding;
begin
Result := TUTF8EncodingWithoutBOM.Create;
end;
function TUTF8EncodingWithoutBOM.GetPreamble: TBytes;
begin
Result := nil;
end;
Instantiate one of these (you only need one instance per process) and pass it to TFile.ReadAllText
.
The advantage of using a singleton instance of TUTF8EncodingWithoutBOM
is that you can use it anywhere that expects a TEncoding
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With