Find out number of characters in a UTF-8 string in Java/Android

Tags:

I am trying to find out a string length when the string is stored in UTF-8. I tried following approach:

String str = "à¤®à¥à¤°à¤¾ à¤¨à¤¾à¤®";
Charset UTF8_CHARSET = Charset.forName("UTF-8");
byte[] abc = str.getBytes(UTF8_CHARSET);
int length = abc.length;

This gives me length of the byte array, but not number of characters in the string.

I found a website which shows both UTF-8 string length and byte length. https://mothereff.in/byte-counter Let's say my string is à¤®à¥à¤°à¤¾ à¤¨à¤¾à¤®, then I should get string length as 8 characters and not 22 bytes.

Could anyone please guide on this.

725

asked Apr 19 '15 06:04

Sujit Devkar

1 Answers

The shortest "length" is in Unicode code points, as notion of numbered character, UTF-32.

Correction: As @liudongmiao mentioned probably one should use:

Click to copy

int length = string.codePointCount(0, s.length);

In java 8:

Click to copy

int length = (int) string.codePoints().count();

Prior javas:

Click to copy

int length(String s) {
   int n = 0;
   for (int i = 0; i < s.length(); ++n) {
       int cp = s.codePointAt(i);
       i += Character.charCount(cp);
   }
   return n;
}

A Unicode code point can be encoded in UTF-16 as one or two chars.

The same Unicode character might have diacritical marks. They can be written as separate code points: basic letter + zero or more diacritical marks. To normalize the string to one (C=) compressed code point:

Click to copy

string = java.text.Normalizer.normalize(string, Normalizer.Form.NFC);

BTW for database purposes, the UTF-16 length seems more useful:

Click to copy

string.length() // Number of UTF-16 chars, every char two bytes.

(In the example mentioned UTF-32 length == UTF-16 length.)

A dump function

A commenter had some unexpected result:

Click to copy

void dump(String s) {
   int n = 0;
   for (int i = 0; i < s.length(); ++n) {
       int cp = s.codePointAt(i);
       int bytes = Character.charCount(cp);
       i += bytes;
       System.out.printf("[%d] #%dB: U+%X = %s%n",
           n, bytes, cp, Character.getName(cp));
   }
   System.out.printf("Length:%d%n", n);
}

168

answered Sep 29 '22 19:09

Joop Eggen

Related questions
                            
                                DropWizard Auth Realms
                            
                                JTree select item on right click
                            
                                Dropwizard doesn't log custom loggers to file
                            
                                creating objects with same name as class in java
                            
                                Generating a 4096-bit RSA key is way slower than 2048-bit using Jsch
                            
                                Display OpenCV Mat with JavaFX
                            
                                How do I properly configure an EntityManager in a jersey / hk2 application?
                            
                                Cloneable throws CloneNotSupportedException
                            
                                Unable to override compare() method of Comparator
                            
                                How can I change the Standard Out to "UTF-8" in Java
                            
                                Eclipse Dynamic Web Project how to organize files?
                            
                                Dagger: class could not be bound with key
                            
                                Infinite loop detection
                            
                                Is it good to pass minimal parameters?
                            
                                java.lang.ArrayIndexOutOfBoundsException: length=0; index=0 - Database Reading - Android
                            
                                Groovy version 2.4.2 for Eclipse
                            
                                Differ null and undefined values in Nashorn
                            
                                Deletion in LinkedHashMap vs HashMap
                            
                                Enabling/Disabling buttons in JavaFX [duplicate]
                            
                                Is this Runnable safe from memory leak?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find out number of characters in a UTF-8 string in Java/Android

Tags:

java

android

utf-8

Sujit Devkar

People also ask

1 Answers

Joop Eggen

Recent Activity

Donate For Us