Guessing the encoding of text represented as byte[] in Java

Tags:

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

No additional meta-data is available. The byte array is literally the only available input.
The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

833

asked Nov 04 '09 23:11

knorv

Video Answer

2 Answers

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {     String DEFAULT_ENCODING = "UTF-8";     org.mozilla.universalchardet.UniversalDetector detector =         new org.mozilla.universalchardet.UniversalDetector(null);     detector.handleData(bytes, 0, bytes.length);     detector.dataEnd();     String encoding = detector.getDetectedCharset();     detector.reset();     if (encoding == null) {         encoding = DEFAULT_ENCODING;     }     return encoding; }

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

170

answered Sep 21 '22 06:09

knorv

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

answered Sep 19 '22 06:09

Thomas Mueller

Related questions
                            
                                Android app crashes after SDK-tools update version (NoClassDefFound, tool version 22)
                            
                                Is inconsistency in rounding between Java 7 and Java 8 a bug?
                            
                                Java Executor Best Practices for Tasks that Should Run Forever
                            
                                Initialize member variables in the beginning of class definition or in constructor?
                            
                                What's the C++ idiom equivalent to the Java static block?
                            
                                Proper way to implement RESTful large file upload
                            
                                Better practice to re-instantiate a List or invoke clear()
                            
                                What to put into jta-data-source of persistence.xml?
                            
                                Export eclipse console view output to text file
                            
                                Understanding upper and lower bounds on ? in Java Generics
                            
                                Kotlin - generate toString() for a non-data class
                            
                                What are the benefits of the Iterator interface in Java?
                            
                                What is the relation between ContentPane and JPanel?
                            
                                Does JUnit 3 have something analogous to @Ignore
                            
                                Java Meta-Inf Services
                            
                                Spring interface injection example
                            
                                Learning Java EE - where to start [duplicate]
                            
                                Execute Cucumber step before/after a specific feature
                            
                                Java 8, Lambda : replace Anonymous inner class by lambda
                            
                                What is the reason for these PMD rules?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Guessing the encoding of text represented as byte[] in Java

Tags:

java

character-encoding

encoding

utf-8

knorv

People also ask

Video Answer

2 Answers

knorv

Thomas Mueller

Recent Activity

Donate For Us