I am trying to find out a string length when the string is stored in UTF-8. I tried following approach:
String str = "मà¥à¤°à¤¾ नाम";
Charset UTF8_CHARSET = Charset.forName("UTF-8");
byte[] abc = str.getBytes(UTF8_CHARSET);
int length = abc.length;
This gives me length of the byte array, but not number of characters in the string.
I found a website which shows both UTF-8 string length and byte length. https://mothereff.in/byte-counter Let's say my string is मà¥à¤°à¤¾ नाम, then I should get string length as 8 characters and not 22 bytes.
Could anyone please guide on this.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units.
if(string.charAt(i) != ' ') count++; } //Displays the total number of characters present in the given string.
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8. These code points are the same as those in ASCII CCSID 367.
UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.
The shortest "length" is in Unicode code points, as notion of numbered character, UTF-32.
Correction: As @liudongmiao mentioned probably one should use:
int length = string.codePointCount(0, s.length);
In java 8:
int length = (int) string.codePoints().count();
Prior javas:
int length(String s) {
int n = 0;
for (int i = 0; i < s.length(); ++n) {
int cp = s.codePointAt(i);
i += Character.charCount(cp);
}
return n;
}
A Unicode code point can be encoded in UTF-16 as one or two char
s.
The same Unicode character might have diacritical marks. They can be written as separate code points: basic letter + zero or more diacritical marks. To normalize the string to one (C=) compressed code point:
string = java.text.Normalizer.normalize(string, Normalizer.Form.NFC);
BTW for database purposes, the UTF-16 length seems more useful:
string.length() // Number of UTF-16 chars, every char two bytes.
(In the example mentioned UTF-32 length == UTF-16 length.)
A dump function
A commenter had some unexpected result:
void dump(String s) {
int n = 0;
for (int i = 0; i < s.length(); ++n) {
int cp = s.codePointAt(i);
int bytes = Character.charCount(cp);
i += bytes;
System.out.printf("[%d] #%dB: U+%X = %s%n",
n, bytes, cp, Character.getName(cp));
}
System.out.printf("Length:%d%n", n);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With