Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode-correct title case in Java

I've been looking through all StackOverflow in the bazillion of questions about capitalizing a word in Java, and none of them seem to care the least about internationalization and as a matter of fact none really seem to work in an international context. So here is my question.

I have a String in Java, which represents a word - all isLetter() characters, no whitespace. I want to make the first character upper case and the rest lower case. I do have the locale of my word in handy.

It's easy enough to call .substring(1).toLowerCase(Locale) for the last part of my string. I have no idea how to get the correct first character, though.

The first problem I have is with Dutch, where "ij" being a digraph should be capitalized together. I could special-case this by hand, because I know about it; now there may be other languages with this kind of thing that I don't know about, and I'm sure Unicode will tell me if I ask nicely. But I don't know how to ask.

Even if the above problem is solved, I'm still stuck with no proper way to handle English, Turkish and Greek, because Character supports titlecase but no locale, and String supports locales but not titlecase.

If I take the code point, and pass it to Character.toTitleCase(), this will fail because there is no way to pass the locale to this method. So if the system locale is in English but the word is Turkish, and the first char of the word is "i", I'll get "I" instead of "İ" and this is wrong. Now if I take a substring and use .toUpperCase(Locale), this will fail because it's upper and not title case. So if the word is Greek, I'll still get the wrong character.

If anyone has useful pointers, I'd be happy to hear them.

like image 638
Jean Avatar asked Sep 09 '11 11:09

Jean


People also ask

How do you create a title case in Java?

toTitleCase(char ch) converts the character argument to titlecase using case mapping information from the UnicodeData file. If a character has no explicit titlecase mapping and is not itself a titlecase char according to UnicodeData, then the uppercase mapping is returned as an equivalent titlecase mapping.

Is there a title method in Java?

There are no capitalize() or titleCase() methods in Java's String class.

Can you use Unicode in Java?

Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier. You may use Unicode to convey comments, ids, character content, and string literals, as well as other information.

What Unicode format does Java use?

Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two char s. This is reflected by API methods such as String.


1 Answers

Like you, I was unable to find a suitable method in the core Java API.

However, there does seem to be a locale-sensitive string-title-case method (UCharacter#toTitleCase) in the ICU library.


Looking at the source for the relevant ICU methods (UCharacter#toTitleCase and UCaseProps#toUpperOrTitle), there don't seem to be many locale-specific special cases for title-casing, so you might be able to get away with the following:

  1. Find the first cased character in the string.
  2. If it has a title-case form distinct from its upper-case form, use that.
  3. Otherwise, perform a locale-sensitive upper-case on that first character and its combining characters.
  4. Perform a locale-sensitive lower-case on the rest of the string.
  5. If the locale is Dutch and the first cased character is an "I" followed by a "j", upper-case the "j".
like image 178
Stuart Cook Avatar answered Sep 21 '22 06:09

Stuart Cook