How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

Tags:

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.

Answer:

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

612

asked Dec 16 '08 22:12

lkessler

1 Answers

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

124

answered Sep 21 '22 17:09

2 revs

Related questions
                            
                                Which row has the most 1s in a 0-1 matrix with all 1s "on the left"?
                            
                                Why is Binary Search a divide and conquer algorithm?
                            
                                Postfix notation to expression tree
                            
                                Sorting numbers from 1 to 999,999,999 in words as strings
                            
                                How and when to create a suffix link in suffix tree?
                            
                                How do I calculate the week number given a date?
                            
                                How to calculate order (big O) for more complex algorithms (eg quicksort)
                            
                                Predicate vs Functions in First order logic
                            
                                Is it idiomatically ok to put algorithm into class?
                            
                                fast & efficient least squares fit algorithm in C?
                            
                                The simplest algorithm for poker hand evaluation
                            
                                Shortest distance between points algorithm
                            
                                how to measure running time of algorithms in python [duplicate]
                            
                                How does Firefox's 'awesome' bar match strings?
                            
                                Smoothing data from a sensor
                            
                                Why is O(n) better than O( nlog(n) )?
                            
                                Quickselect Algorithm - Simplified Explanation
                            
                                What are the rules for the "Ω(n log n) barrier" for sorting algorithms?
                            
                                PyMC: Taking advantage of sparse model structure in Adaptive Metropolis MCMC
                            
                                How can I generate an "unlimited" world?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?

Tags:

algorithm

encoding

byte-order-mark

delphi

delphi-2009

lkessler

People also ask

1 Answers

2 revs

Recent Activity

Donate For Us