Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TStringList behavior with non ANSI files

In my application, when I want import a file, i use TStringList.

But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.

There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?

Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.

Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)

like image 710
EProgrammerNotFound Avatar asked Apr 26 '13 15:04

EProgrammerNotFound


1 Answers

I used the following function in Delphi 6 to detect Unicode BOMs.

const
  //standard byte order marks (BOMs)
  UTF8BOM:              array [0..2] of AnsiChar = #$EF#$BB#$BF;
  UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
  UTF16BigEndianBOM:    array [0..1] of AnsiChar = #$FE#$FF;
  UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
  UTF32BigEndianBOM:    array [0..3] of AnsiChar = #$00#$00#$FE#$FF;

function FileHasUnicodeBOM(const FileName: string): Boolean;
var
  Buffer: array [0..3] of AnsiChar;
  Stream: TFileStream;
begin
  Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
  Try
    FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
    Stream.Read(Buffer, SizeOf(Buffer));  //...read up to SizeOf(Buffer) bytes - there may not be enough
    //use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
  Finally
    FreeAndNil(Stream);
  End;
  Result := CompareMem(@UTF8BOM,              @Buffer, SizeOf(UTF8BOM))              or
            CompareMem(@UTF16LittleEndianBOM, @Buffer, SizeOf(UTF16LittleEndianBOM)) or
            CompareMem(@UTF16BigEndianBOM,    @Buffer, SizeOf(UTF16BigEndianBOM))    or
            CompareMem(@UTF32LittleEndianBOM, @Buffer, SizeOf(UTF32LittleEndianBOM)) or
            CompareMem(@UTF32BigEndianBOM,    @Buffer, SizeOf(UTF32BigEndianBOM));
end;

This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.

You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0. Which I guess is not what you want.

If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode. However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.

Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList.

like image 56
David Heffernan Avatar answered Nov 08 '22 21:11

David Heffernan