TLDR
Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?
Details
Java represents a character as UTF-16. But the Character
class itself wraps char
(16 bit). For UTF-16, it will be an array of two char
s (32 bit).
Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)
To be specific, do you convert char[] to int[] or is there a better way to sort?
import java.util.Arrays;
public class Main {
public static void main(String[] args) {
int[] utfCodes = {128513, 128531, 128557};
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);
char[] chars = emojis.toCharArray();
Arrays.sort(chars);
System.out.println("Sorted String: " + new String(chars));
}
}
Output:
Initial String: 😁😓😭
Sorted String: ??😁??
I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.
Luckily, the codePoints
of the String
are what you used to create the String
itself in this example, so you can simply sort those and create a new String
with the result.
public static void main(String[] args) {
int[] utfCodes = {128531, 128557, 128513};
String emojis = new String(utfCodes, 0, 3);
System.out.println("Initial String: " + emojis);
int[] codePoints = emojis.codePoints().sorted().toArray();
System.out.println("Sorted String: " + new String(codePoints, 0, 3));
}
Initial String: 😓😭😁
Sorted String: 😁😓😭
I switched the order of the characters in your example because they were already sorted.
If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:
int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);
Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.
Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.
(When was the last time you tested for anagrams of emojis?)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With