Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check the charset of string in Java?

In my application I'm getting the user info from LDAP and sometimes the full username comes in a wrong charset. For example:

ТеÑÑ61 ТеÑÑовиÑ61 

It can also be in English or in Russian and displayed correctly. If the username changes it's updated in database. Even if I change the value in the db it wont solve the problem.

I can fix it before saving by doing this

new String(incorrect.getBytes("ISO-8859-1"), "UTF-8"); 

However, if I will use it for the string including characters in Russian (for ex., "Тест61 Тестович61") I get something like this "????61 ????????61".

Can you please suggest something that can determine the charset of string?

like image 519
Adilya Taimussova Avatar asked Jul 16 '12 03:07

Adilya Taimussova


People also ask

How do you find a charset of a string?

To find out what character set or collation a string has, use the CHARSET() or COLLATION() function.

How do you find a charset in Java?

To get the name of the character set, which can be used as an encoding name in Java, you use the getName() method: CharsetMatch match = ...; byte characterData[] = ...; String charsetName; String unicodeData; charsetName = match. getName(); unicodeData = new String(characterData, charsetName);

What is charset of string in Java?

The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.

How do I check if a string is UTF-8?

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'.


1 Answers

Strings in java, AFAIK, do not retain their original encoding - they are always stored internally in some Unicode form. You want to detect the charset of the original stream/bytes - this is why I think your String.toBytes() call is too late.

Ideally if you could get the input stream you are reading from, you can run it through something like this: http://code.google.com/p/juniversalchardet/

There are plenty of other charset detectors out there as well

like image 143
radai Avatar answered Oct 19 '22 02:10

radai