Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Guessing the encoding of text represented as byte[] in Java

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.
like image 833
knorv Avatar asked Nov 04 '09 23:11

knorv


People also ask

How is a byte represented in Java?

The eight primitive data types supported by the Java programming language are: byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive).

Can we convert String to byte array in Java?

We can use String class getBytes() method to encode the string into a sequence of bytes using the platform's default charset. This method is overloaded and we can also pass Charset as argument. Here is a simple program showing how to convert String to byte array in java.


Video Answer


2 Answers

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {     String DEFAULT_ENCODING = "UTF-8";     org.mozilla.universalchardet.UniversalDetector detector =         new org.mozilla.universalchardet.UniversalDetector(null);     detector.handleData(bytes, 0, bytes.length);     detector.dataEnd();     String encoding = detector.getDetectedCharset();     detector.reset();     if (encoding == null) {         encoding = DEFAULT_ENCODING;     }     return encoding; } 

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

like image 170
knorv Avatar answered Sep 21 '22 06:09

knorv


There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

like image 30
Thomas Mueller Avatar answered Sep 19 '22 06:09

Thomas Mueller