Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?
Worth noting:
The eight primitive data types supported by the Java programming language are: byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive).
We can use String class getBytes() method to encode the string into a sequence of bytes using the platform's default charset. This method is overloaded and we can also pass Charset as argument. Here is a simple program showing how to convert String to byte array in java.
The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.
public static String guessEncoding(byte[] bytes) { String DEFAULT_ENCODING = "UTF-8"; org.mozilla.universalchardet.UniversalDetector detector = new org.mozilla.universalchardet.UniversalDetector(null); detector.handleData(bytes, 0, bytes.length); detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); if (encoding == null) { encoding = DEFAULT_ENCODING; } return encoding; }
The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.
I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.
There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With