Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: Multibyte string length

Tags:

java

I have a method which prints "header text" for command line programs, much like the syntax of Markdown:

1. =======================
2. This is a header string
3. =======================

This method takes a char c for lines 1 and 3 and repeats it n times based on the length of s.

String.length() works fine with the English alphabet, but how can I find the length (the visual length, that is) of a string containing foreign multibyte characters like "Å" and "Ç"?

like image 705
josocblaugrana Avatar asked Oct 03 '12 15:10

josocblaugrana


2 Answers

String.length will be fine for those sorts of characters, as Java strings work in UTF-16, which is sufficient to represent the vast majority of characters in common use (Latin, Greek, Arabic, Hebrew, Chinese, Thai, Devanagari, ...).

If you might need to deal with characters above U+FFFF then you need to use codePointCount instead of length to cope with surrogate pairs.

like image 154
Ian Roberts Avatar answered Sep 18 '22 01:09

Ian Roberts


String.length() is fine for most Unicode characters including Å and Ç.

A Java string is utf-16 encoded where each Character takes up 2 or 4 bytes.

Supplementary characters denotes the characters taking 4 bytes and is implemented by pairing two characters, in which case the codePointCount operation must be used instead of length.

Characters though most certainly exist in the standard unicode specification.

like image 28
Johan Sjöberg Avatar answered Sep 21 '22 01:09

Johan Sjöberg