How can I check if a string is in valid UTF-8 format?
How can I check if a string is in valid UTF-8 format? you mean byte[] is validly encoded? The simplest thing to do might be to decode it and encode it again. Check you get the same thing.
A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
By far the most popular character encoding today is UTF-8, part of the unicode standard. How quickly can we check whether a sequence of bytes is valid UTF-8? Any ASCII string is a valid UTF-8 string. An ASCII character is simply a byte value in [0,127] or [0x00, 0x7F] in hexadecimal.
Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.
Also only byte arrays can be UTF-8 encoded.
Here is a common case of UTF-8 conversions.
String myString = "\u0048\u0065\u006C\u006C\u006F World"; System.out.println(myString); byte[] myBytes = null; try { myBytes = myString.getBytes("UTF-8"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); System.exit(-1); } for (int i=0; i < myBytes.length; i++) { System.out.println(myBytes[i]); }
If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With