After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding
in InputStreamReader
does not work:
However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?
I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:
So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.
Notice that ICU4J is still maintained.
Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.
AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).
I found an answer online:
http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
It says something vealuable here:
The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.
So that's why I am using cpdetector now. I will update the post with the result of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With