Is there an easy way to work around a Delphi utf8-file flaw?

Tags:

I have discovered (the hard way) that if a file has a valid UTF-8 BOM but contains any invalid UTF8 encodings, and is read by any of the Delphi (2009+) encoding-enabled methods such as LoadFromFile, then the result is a completely empty file with no error indication. In several of my applications, I would prefer to simply lose a few bad encodings, even if I get no error report in this case either.

Debugging reveals that MultiByteToWideChar is called twice, first to get the output buffer size, then to do the conversion. But TEncoding.UTF8 contains a private FMBToWCharFlags value for these calls, and this is initialized with a MB_ERR_INVALID_CHARS value. So the call to get the charcount returns 0 and the loaded file is completely empty. Calling this API without the flag would 'silently drop illegal code points'.

My question is how best to weave through the nest of classes in the Encoding area to work around the fact that this is a private value (and needs to be, because it is a class var for all threads). I think I could add a custom UTF8 encoding, using the guidance in Marco Cantu's Delphi 2009 book. And it could optionally raise an exception if MultiByteToWideChar has returned an encoding error, after calling it again without the flag. But that does not solve the problem of how to get my custom encoding used instead of Tencoding.UTF8.

If I could just set this up as a default for the application at initialization, perhaps by actually modifying the class var for Tencoding.UFT8, this would probably be sufficient.

Of course, I need a solution without waiting to lodge a QC report asking for a more robust design, getting it accepted, and seeing it changed.

Any ideas would be very welcome. And can someone confirm this is still an issue for XE4, which I have not yet installed?

738

asked May 13 '13 23:05

frogb

2 Answers

I ran into the MB_ERR_INVALID_CHARS issue when I first updated Indy to support TEncoding, and ended up implementing a custom TEncoding-derived class for UTF-8 handling to avoid specifying MB_ERR_INVALID_CHARS. I didn't think to use a class helper.

However, this issue is not just limited to UTF-8. Any decoding failure of any of the TEncoding classes will result in a blank result, not an exception being raised. Why Embarcadero chose that route, when most of the RTL/VCL uses exceptions instead, is beyond me. Not raising an exception on error caused a fair amount of issues in Indy that had to be worked around.

answered Oct 21 '22 01:10

Remy Lebeau

This can be done pretty simply, at least in Delphi XE5 (have not checked earlier versions). Just instantiate your own TUTF8Encoding:

procedure LoadInvalidUTF8File(const Filename: string);
var
  FEncoding: TUTF8Encoding;
begin
  FEncoding := TUTF8Encoding.Create(CP_UTF8, 0, 0); 
                      // Instead of CP_UTF8, MB_ERR_INVALID_CHARS, 0
  try
    with TStringList.Create do
    try
      LoadFromFile(Filename, FEncoding);
      // ...
    finally
      Free;
    end;
  finally
    FEncoding.Free;
  end;
end;

The only issue here is that the IsSingleByte property for the newly instantiated TUTF8Encoding is then incorrectly set to False, but this property is not currently used anywhere in the Delphi sources.

answered Oct 21 '22 00:10

Marc Durdin

Related questions
                            
                                TControlState.csDesignerHide vs. TControlStyle.csNoDesignVisible
                            
                                Finding unused (aka "dead") code in Delphi
                            
                                Possible causes of "EOSError 1400 - Invalid window handle"
                            
                                How to have the minimum size possible Chromium Embedded Framework dlls
                            
                                How to get the list of classes derived from a given class, with enhanced RTTI?
                            
                                Thread.FreeOnTerminate := True, memory leak and ghost running
                            
                                SSH client delphi open-source library or component, or alternatives
                            
                                Why only some users get the error: "Connection is busy with results for another command"
                            
                                In Delphi, is TDataSet thread safe?
                            
                                What is TExternalThread? "Cannot terminate externally created thread" when terminating a thread-based timer
                            
                                What CPU registers are to be restored at the end of an asm procedure in Delphi
                            
                                delphi : how can I change color of a cell in string grid
                            
                                it's possible to create a fake data field in a delphi dataset?
                            
                                What should I do about an internal error when I declare a generic "array of T"?
                            
                                Do generic instantiations in multiple units bloat the executable?
                            
                                Open source component or unit for exporting Delphi TDataSet to native XLS without Excel installed [closed]
                            
                                How can I force Tor to use a new identity without using Vidalia?
                            
                                Delphi XE2 Dataset field type TStringField does not support Unicode?
                            
                                Is it dangerous to use synchronize in a non-VCL application?
                            
                                Can I make a custom Delphi component add multiple units to the uses clause?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an easy way to work around a Delphi utf8-file flaw?

Tags:

utf-8

delphi

frogb

People also ask

2 Answers

Remy Lebeau

Marc Durdin

Recent Activity

Donate For Us