Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the preferred way to compare two Java Strings lexicographically on *Unicode code points*?

For a Java program I'm writing, I have a particular need to sort strings lexicographically by Unicode code point. This is not the same as String.compareTo() when you start dealing with values outside the Basic Multilingual Plane. String.compareTo() compares strings lexicographically on 16-bit char values. To see that this is not equivalent, note that U+FD00 ARABIC LIGATURE HAH WITH YEH ISOLATED FORM is less than U+1D11E MUSICAL SYMBOL G CLEF, but the Java String object "\uFD00" for the Arabic character compares greater than the surrogate pair "\uD834\uDD1E" for the clef.

I can manually loop along the code points using String.codePointAt() and Character.charCount() and do the comparison myself if necessary. Is there an API function or other more "canonical" way of doing this?

like image 931
Aaron Rotenberg Avatar asked Dec 09 '14 17:12

Aaron Rotenberg


People also ask

Which method is used to compare two strings lexicographically?

The compareTo() method compares two strings lexicographically. The comparison is based on the Unicode value of each character in the strings. The method returns 0 if the string is equal to the other string.

How do you compare two strings in Java?

Using String. equals() :In Java, string equals() method compares the two given strings based on the data/content of the string. If all the contents of both the strings are same then it returns true. If any character does not match, then it returns false.

What does it mean to compare strings lexicographically?

Two strings are lexicographically equal if they are the same length and contain the same characters in the same positions.


1 Answers

Its called Collations. See https://docs.oracle.com/javase/tutorial/i18n/text/locale.html

Note that your database can sort your query results using collations too. See for example what mysql supports https://dev.mysql.com/doc/refman/5.0/en/charset-charsets.html

like image 50
jorgeu Avatar answered Oct 07 '22 15:10

jorgeu