I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI. Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8. The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet. Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have? Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating. Thanks in advance.

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.) For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ť&scaron;” in windows-1250, but it is not a valid UTF-8 string. You have to resort to some kind of heuristic, e.g. <ol> <li>If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.</li> <li>Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).</li> <li>Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)</li> </ol> See also the method used by Notepad.

Detecting 'text' file type (ANSI vs UTF-8)

Tags:

utf-8

delphi

delphi-7

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8.

The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet.

Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have?

Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating.

Thanks in advance.

507

asked Feb 05 '11 16:02

No'am Newman

2 Answers

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.

If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)

Mormegil

If we summerize, then:

Best solution for basic usage is to use outdated ( if we use IsTextUnicode(); );
Best solution for advanced usage is to use function above, then check BOM ( ~ 1KB ), then check Locale info under particual OS and only then get about 98% accuracy?

OTHER INFO PEOPLE MAY FOUND INTERESTING:

https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

function FileMayBeUTF8(FileName: WideString): Boolean;
var
 Stream: TMemoryStream;
 BytesRead: integer;
 ArrayBuff: array[0..127] of byte;
 PreviousByte: byte;
 i: integer;
 YesSequences, NoSequences: integer;

begin
   if not WideFileExists(FileName) then
     Exit;
   YesSequences := 0;
   NoSequences := 0;
   Stream := TMemoryStream.Create;
   try
     Stream.LoadFromFile(FileName);
     repeat

     {read from the TMemoryStream}

       BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
           {Do the work on the bytes in the buffer}
       if BytesRead > 1 then
         begin
           for i := 1 to BytesRead-1 do
             begin
               PreviousByte := ArrayBuff[i-1];
               if ((ArrayBuff[i] and $c0) = $80) then
                 begin
                   if ((PreviousByte and $c0) = $c0) then
                     begin
                       inc(YesSequences)
                     end
                   else
                     begin
                       if ((PreviousByte and $80) = $0) then
                         inc(NoSequences);
                     end;
                 end;
             end;
         end;
     until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
     Result := (YesSequences >= NoSequences);

   finally
     Stream.Free;
   end;
end;

Now testing this function...

In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...

Remarks:

WideFileExists() function is taken from TntClasses.pas ( Koders.net source ).

answered Sep 25 '22 20:09

HX_unbanned

Related questions
                            
                                Passing a string to an already running instance of an application
                            
                                How do I reduce "uses" boilerplate for new forms?
                            
                                delphi 7 - create a custom warning
                            
                                Delphi interface inheritance: Why can't I access ancestor interface's members?
                            
                                How to read data from xml file and display it over the text box in delphi language
                            
                                How do I check if an embedded resource exists or not?
                            
                                Delphi variable might not have been initialized warning
                            
                                How can I compute a difference between two strings?
                            
                                Is it possible to use two record helpers for the string type?
                            
                                Should variable be initialized before call to function?
                            
                                Why does Delphi use double to store Date and Time instead of Int64?
                            
                                How to detect ctrl-t keypress in Delphi
                            
                                IdHttp Just Get Response Code
                            
                                EmptyParam was variable now a function - How to resolve legacy code?
                            
                                How many bytes are required to hold N decimal digits
                            
                                Detect if Windows closing or application tries to close from system menu (WM_CLOSE)
                            
                                Delphi Change main form while application is running
                            
                                TRTTIContext multi-thread issue
                            
                                How to check if string is a valid DateTime Format string in Delphi
                            
                                Is it possible to have a global exception hook?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With