Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if a String is valid UTF-8 encoded in Java

How can I check if a string is in valid UTF-8 format?

like image 303
Michael Bavin Avatar asked Jul 08 '11 09:07

Michael Bavin


People also ask

How do I check if a String is UTF-8 encoded?

How can I check if a string is in valid UTF-8 format? you mean byte[] is validly encoded? The simplest thing to do might be to decode it and encode it again. Check you get the same thing.

Is Java a UTF-8 String?

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.

How can I tell if a text is encoded?

So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.

Is UTF-8 a String?

By far the most popular character encoding today is UTF-8, part of the unicode standard. How quickly can we check whether a sequence of bytes is valid UTF-8? Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal.


1 Answers

Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.

Also only byte arrays can be UTF-8 encoded.

Here is a common case of UTF-8 conversions.

String myString = "\u0048\u0065\u006C\u006C\u006F World"; System.out.println(myString); byte[] myBytes = null;  try  {     myBytes = myString.getBytes("UTF-8"); }  catch (UnsupportedEncodingException e) {     e.printStackTrace();     System.exit(-1); }  for (int i=0; i < myBytes.length; i++) {     System.out.println(myBytes[i]); } 

If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.

like image 171
DArkO Avatar answered Sep 23 '22 15:09

DArkO