How to correctly compute the length of a String in Java?

1 Answers

The normal model of Java string length

String.length() is specified as returning the number of char values ("code units") in the String. That is the most generally useful definition of the length of a Java String; see below.

Your description¹ of the semantics of length based on the size of the backing array/array slice is incorrect. The fact that the value returned by length() is also the size of the backing array or array slice is merely an implementation detail of typical Java class libraries. String does not need to be implemented that way. Indeed, I think I've seen Java String implementations where it WASN'T implemented that way.

Alternative models of string length.

To get the number of Unicode codepoints in a String use str.codePointCount(0, str.length()) -- see the javadoc.

To get the size (in bytes) of a String in a specific encoding (i.e. charset) use str.getBytes(charset).length².

To deal with locale-specific issues, you can use Normalizer to normalize the String to whatever form is most appropriate to your use-case, and then use codePointCount as above. But in some cases, even this won't work; e.g. the Hungarian letter counting rules which the Unicode standard apparently doesn't cater for.

Using String.length() is generally OK

The reason that most applications use String.length() is that most applications are not concerned with counting the number of characters in words, texts, etcetera in a human-centric way. For instance, if I do this:

String s = "hi mum how are you"; int pos = s.indexOf("mum"); String textAfterMum = s.substring(pos + "mum".length());

it really doesn't matter that "mum".length() is not returning code points or that it is not a linguistically correct character count. It is measuring the length of the string using the model that is appropriate to the task at hand. And it works.

Obviously, things get a bit more complicated when you do multilingual text analysis; e.g. searching for words. But even then, if you normalize your text and parameters before you start, you can safely code in terms of "code units" rather than "code points" most of the time; i.e. length() still works.

^{1 - This description was on some versions of the question. See the edit history ... if you have sufficient rep points.
2 - Using str.getBytes(charset).length entails doing the encoding and throwing it away. There is possibly a general way to do this without that copy. It would entail wrapping the String as a CharBuffer, creating a custom ByteBuffer with no backing to act as a byte counter, and then using Encoder.encode(...) to count the bytes. Note: I have not tried this, and I would not recommend trying unless you have clear evidence that getBytes(charset) is a significant performance bottleneck.}

143

answered Oct 19 '22 08:10

Stephen C

Related questions
                            
                                Disable setting of focus for jQuery UI datepicker
                            
                                Export dynamic html table to excel in javascript in firefox browser
                            
                                Assigning function result to SQL variable and displaying
                            
                                How do I programmatically add buttons into layout one by one in several lines?
                            
                                undefined reference to `__gxx_personality_sj0`
                            
                                What is the Java 7 try-with-resources bytecode equivalent using try-catch-finally?
                            
                                Performance of Java Enums?
                            
                                How to create web API service in PHP [duplicate]
                            
                                Copying file from one project to another in maven
                            
                                Why is this sinon spy not being called when I run this test?
                            
                                Using $inc to increment a document property with Mongoose
                            
                                Positional index in Ember.js collections iteration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With