Character Encoding Detection Algorithm

Tags:

character-encoding

I'm looking for a way to detect character sets within documents. I've been reading the Mozilla character set detection implementation here:

Universal Charset Detection

I've also found a Java implementation of this called jCharDet:

JCharDet

Both of these are based on research carried out using a set of static data. What I'm wondering is whether anybody has used any other implementation successfully and if so what? Did you roll your own approach and if so what was the algorithm you used to detect the character set?

Any help would be appreciated. I'm not looking for a list of existing approaches via Google, nor am I looking for a link to the Joel Spolsky article - just to clarify : )

UPDATE: I did a bunch of research into this and ended up finding a framework called cpdetector that uses a pluggable approach to character detection, see:

CPDetector

This provides BOM, chardet (Mozilla approach) and ASCII detection plugins. It's also very easy to write your own. There's also another framework, which provides much better character detection that the Mozilla approach/jchardet etc...

ICU4J

It's quite easy to write your own plugin for cpdetector that uses this framework to provide a more accurate character encoding detection algorithm. It works better than the Mozilla approach.

972

asked Apr 21 '09 18:04

Jon

1 Answers

Years ago we had character set detection for a mail application, and we rolled our own. The mail app was actually a WAP application, and the phone expected UTF-8. There were several steps:

Universal

We could easily detect if text was UTF-8, as there is a specific bit pattern in the top bits of bytes 2/3/etc. Once you found that pattern repeated a certain number of times you could be certain it was UTF-8.

If the file begins with a UTF-16 byte order mark, you can probably assume the rest of the text is that encoding. Otherwise, detecting UTF-16 isn't nearly as easy as UTF-8, unless you can detect the surrogate pairs pattern: but the use of surrogate pairs is rare, so that doesn't usually work. UTF-32 is similar, except there are no surrogate pairs to detect.

Regional detection

Next we would assume the reader was in a certain region. For instance, if the user was seeing the UI localized in Japanese, we could then attempt detection of the three main Japanese encodings. ISO-2022-JP is again east to detect with the escape sequences. If that fails, determining the difference between EUC-JP and Shift-JIS is not as straightforward. It's more likely that a user would receive Shift-JIS text, but there were characters in EUC-JP that didn't exist in Shift-JIS, and vice-versa, so sometimes you could get a good match.

The same procedure was used for Chinese encodings and other regions.

User's choice

If these didn't provide satisfactory results, the user must manually choose an encoding.

answered Sep 22 '22 12:09

Jared Oberhaus

Related questions
                            
                                How can I calculate a time span in Java and format the output?
                            
                                Find total hours between two Dates
                            
                                If statement with String comparison fails [duplicate]
                            
                                shortest way of filling an array with 1,2...n
                            
                                How do I generate a random integer between min and max in Java?
                            
                                Speeding up Tomcat in debug mode with Eclipse IDE
                            
                                Java introspection: object to map
                            
                                How do I find what Java version Tomcat6 is using?
                            
                                Java says FileNotFoundException but file exists
                            
                                Autowiring fails: Not an managed Type
                            
                                Maven Error : Maven Project Configuration for Module isn't available
                            
                                Find the max of 3 numbers in Java with different data types
                            
                                How can I perform multiplication without the '*' operator?
                            
                                a = (a++) * (a++) gives strange results in Java [closed]
                            
                                Spring-Data JPA CrudRepository returns Iterable, is it OK to cast this to List?
                            
                                How to make a boolean variable switch between true and false every time a method is invoked?
                            
                                sendUserActionEvent() returned
                            
                                Why there is performance degradation after ~6 hours of Java 9 G1 work without the actual increase in load?
                            
                                How do I "subtract" a color filter using GPUImageLibrary?
                            
                                Ambiguous overloaded java methods with generics and varargs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Character Encoding Detection Algorithm

Tags:

java

character-encoding

Jon

People also ask

1 Answers

Jared Oberhaus

Recent Activity

Donate For Us