There is a String variable containing ascii characters and double bytes characters(for example, the Chinese, Japanese,...).
How to decide the total length of the String ? Also, I want to implement with the string substring/replace function.
The string type in Java is implicitly UTF-16. All other encodings (e.g. UTF-8) should be represented using byte arrays.
"Length" is an ambiguous term.
Each Unicode code point will consume one or two code units (16-bit chars) - the basic multilingual plane and the supplementary ranges. When transcoded to different encodings, the number of bytes a string will consume can change. A sequence of code points can also combine to form a single user-visible grapheme.
So, here are ways to measure the "length" of a string:
I covered some of this in a blog post.
Comment: And is there an easy way/API to handle the mix-bytes String? (to cut/shorten/substring() the string like "sDDsssDDDDsDD" (s:single byte ascii character, DD:double bytes character)?
Consider the Java string literal "Hello 您好世界" which can also be expressed as "Hello \u60a8\u597d\u4e16\u754c".
This could be encoded in the legacy Windows Simplified Chinese double-byte encoding as the byte sequence:
48 65 6c 6c 6f 20 c4 fa ba c3 ca c0 bd e7
In order to turn this into Java characters, you would decode it:
byte[] data = { 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, (byte) 0xc4,
(byte) 0xfa, (byte) 0xba, (byte) 0xc3, (byte) 0xca, (byte) 0xc0,
(byte) 0xbd, (byte) 0xe7 };
Charset encoding = Charset.forName("x-mswin-936");
String hello = new String(data, encoding);
Now that you've transcoded the data to Unicode, you can use the usual string manipulation mechanisms (substring, regex matching, etc.).
Note that you must know the double-byte encoding you use before transformation. If you don't know the encoding, all you have is junk.
I don't know what encodings Android supports, but you can discover this at runtime by calling Charset.availableCharsets(). If Android doesn't support an encoding you need, have a look at the ICU4J library.
As other's have said, Java Strings are conceptually read-only arrays of Java characters, and the "length" of a String is the number of characters. However, there are complicating issues:
A Java character is not necessarily what you think of as a character. In particular, there are more Unicode characters (code-points) than can be represented using Java characters. Some Unicode code-points require two Java characters to represent them. (This is the "extended plane" issue that Thilo refers to.)
Some JVMs (with the appropriate JVM flags set at startup) will use a String representation where the characters are encoded in UTF-8. While the length of the String is the same (in this case, the number of Java characters represented by the UTF-8), the memory used can be significantly less.
Then there is the question of how many bytes are required to represent the String's characters as UTF-8, or in some other encoding. As far as I know, the only JVM provided way to find that out is to do the conversion; e.g. using getBytes(charSet).
Finally, there is the question of how many bytes a String occupies in the heap. You can find out how many bytes are in the String object and its associated char[] backing object. However, predicting what that is going to be can be tricky, when you consider that substring and other String methods can create sets of strings that share a single backing array.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With