Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I iterate through the unicode codepoints of a Java String?

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

  • using String#charAt(int) to get the char at an index
  • testing whether the char is in the high-surrogates range
    • if so, use String#codePointAt(int) to get the codepoint, and increment the index by 2
    • if not, use the given char value as the codepoint, and increment the index by 1

But my concerns are

  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.
like image 397
rampion Avatar asked Oct 06 '09 20:10

rampion


People also ask

How do you escape Unicode characters in Java?

According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits. So for example \u000A will be treated as a line feed.

What is Unicode in Java string?

Unicode is an international standard of character encoding which has the capability of representing a majority of written languages all over the globe. Unicode uses hexadecimal to represent a character. Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF.

Can you use Unicode in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information.


1 Answers

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length(); for (int offset = 0; offset < length; ) {    final int codepoint = s.codePointAt(offset);     // do something with the codepoint     offset += Character.charCount(codepoint); } 
like image 180
Jonathan Feinberg Avatar answered Oct 03 '22 17:10

Jonathan Feinberg