I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.
Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8.
The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet.
Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have?
Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating.
Thanks in advance.
Determining the format Look at the "Format:" field; If it says “Personal Folders File” or “Outlook Data File”, it means that you are in UNICODE format. If it says “Personal Folders File (97 - 2002)” or “Outlook Data File (97-2002)”, it means you are in ANSI format.
To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command.
Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.
ANSI and UTF-8 are both encoding formats. ANSI is the common one byte format used to encode Latin alphabet; whereas, UTF-8 is a Unicode format of variable length (from 1 to 4 bytes) which can encode all possible characters.
There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)
For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.
You have to resort to some kind of heuristic, e.g.
See also the method used by Notepad.
If we summerize, then:
OTHER INFO PEOPLE MAY FOUND INTERESTING:
https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA
function FileMayBeUTF8(FileName: WideString): Boolean;
var
Stream: TMemoryStream;
BytesRead: integer;
ArrayBuff: array[0..127] of byte;
PreviousByte: byte;
i: integer;
YesSequences, NoSequences: integer;
begin
if not WideFileExists(FileName) then
Exit;
YesSequences := 0;
NoSequences := 0;
Stream := TMemoryStream.Create;
try
Stream.LoadFromFile(FileName);
repeat
{read from the TMemoryStream}
BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
{Do the work on the bytes in the buffer}
if BytesRead > 1 then
begin
for i := 1 to BytesRead-1 do
begin
PreviousByte := ArrayBuff[i-1];
if ((ArrayBuff[i] and $c0) = $80) then
begin
if ((PreviousByte and $c0) = $c0) then
begin
inc(YesSequences)
end
else
begin
if ((PreviousByte and $80) = $0) then
inc(NoSequences);
end;
end;
end;
end;
until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
Result := (YesSequences >= NoSequences);
finally
Stream.Free;
end;
end;
Now testing this function...
In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...
Remarks:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With