In Android, using java: determine string length of mix-bytes character String?

Question

There is a String variable containing ascii characters and double bytes characters(for example, the Chinese, Japanese,...).

How to decide the total length of the String ? Also, I want to implement with the string substring/replace function.

McDowell · Accepted Answer

The string type in Java is implicitly UTF-16. All other encodings (e.g. UTF-8) should be represented using byte arrays.

"Length" is an ambiguous term.

Each Unicode code point will consume one or two code units (16-bit chars) - the basic multilingual plane and the supplementary ranges. When transcoded to different encodings, the number of bytes a string will consume can change. A sequence of code points can also combine to form a single user-visible grapheme.

So, here are ways to measure the "length" of a string:

number of bytes: String.getBytes(Charset).length
number of chars: String.length()
number of code points: String.codePointCount(int,int)
number of graphemes: BreakIterator.getCharacterInstance(Locale)

I covered some of this in a blog post.

Comment: And is there an easy way/API to handle the mix-bytes String? (to cut/shorten/substring() the string like "sDDsssDDDDsDD" (s:single byte ascii character, DD:double bytes character)?

Consider the Java string literal "Hello 您好世界" which can also be expressed as "Hello \u60a8\u597d\u4e16\u754c".

This could be encoded in the legacy Windows Simplified Chinese double-byte encoding as the byte sequence:

48 65 6c 6c 6f 20 c4 fa ba c3 ca c0 bd e7

In order to turn this into Java characters, you would decode it:

byte[] data = { 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, (byte) 0xc4,
    (byte) 0xfa, (byte) 0xba, (byte) 0xc3, (byte) 0xca, (byte) 0xc0,
    (byte) 0xbd, (byte) 0xe7 };
Charset encoding = Charset.forName("x-mswin-936");
String hello = new String(data, encoding);

Now that you've transcoded the data to Unicode, you can use the usual string manipulation mechanisms (substring, regex matching, etc.).

Note that you must know the double-byte encoding you use before transformation. If you don't know the encoding, all you have is junk.

I don't know what encodings Android supports, but you can discover this at runtime by calling Charset.availableCharsets(). If Android doesn't support an encoding you need, have a look at the ICU4J library.

Stephen C · Answer

As other's have said, Java Strings are conceptually read-only arrays of Java characters, and the "length" of a String is the number of characters. However, there are complicating issues:

A Java character is not necessarily what you think of as a character. In particular, there are more Unicode characters (code-points) than can be represented using Java characters. Some Unicode code-points require two Java characters to represent them. (This is the "extended plane" issue that Thilo refers to.)
Some JVMs (with the appropriate JVM flags set at startup) will use a String representation where the characters are encoded in UTF-8. While the length of the String is the same (in this case, the number of Java characters represented by the UTF-8), the memory used can be significantly less.

Then there is the question of how many bytes are required to represent the String's characters as UTF-8, or in some other encoding. As far as I know, the only JVM provided way to find that out is to do the conversion; e.g. using getBytes(charSet).

Finally, there is the question of how many bytes a String occupies in the heap. You can find out how many bytes are in the String object and its associated char[] backing object. However, predicting what that is going to be can be tricky, when you consider that substring and other String methods can create sets of strings that share a single backing array.

In Android, using java: determine string length of mix-bytes character String?

Tags:

java

android

cmh

2 Answers

McDowell

Stephen C

Recent Activity

Donate For Us

In Android, using java: determine string length of mix-bytes character String?

Tags:

java

android

cmh

2 Answers

McDowell

Stephen C

Related questions

Recent Activity

Donate For Us