Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting the characters in a UTF-16 string in Java

TLDR

Java uses two characters to represent UTF-16. Using Arrays.sort (unstable sort) messes with character sequencing. Should I convert char[] to int[] or is there a better way?

Details

Java represents a character as UTF-16. But the Character class itself wraps char (16 bit). For UTF-16, it will be an array of two chars (32 bit).

Sorting a string of UTF-16 characters using the inbuilt sort messes with data. (Arrays.sort uses dual pivot quick sort and Collections.sort uses Arrays.sort to do the heavy lifting.)

To be specific, do you convert char[] to int[] or is there a better way to sort?

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        int[] utfCodes = {128513, 128531, 128557};
        String emojis = new String(utfCodes, 0, 3);
        System.out.println("Initial String: " + emojis);

        char[] chars = emojis.toCharArray();
        Arrays.sort(chars);
        System.out.println("Sorted String: " + new String(chars));
    }
}

Output:

Initial String: 😁😓😭
Sorted String: ??😁??
like image 833
dingy Avatar asked Apr 23 '19 02:04

dingy


2 Answers

I looked around for a bit and couldn't find any clean ways to sort an array by groupings of two elements without the use of a library.

Luckily, the codePoints of the String are what you used to create the String itself in this example, so you can simply sort those and create a new String with the result.

public static void main(String[] args) {
    int[] utfCodes = {128531, 128557, 128513};
    String emojis = new String(utfCodes, 0, 3);
    System.out.println("Initial String: " + emojis);

    int[] codePoints = emojis.codePoints().sorted().toArray();
    System.out.println("Sorted String: " + new String(codePoints, 0, 3));
}

Initial String: 😓😭😁

Sorted String: 😁😓😭

I switched the order of the characters in your example because they were already sorted.

like image 148
Jacob G. Avatar answered Nov 01 '22 19:11

Jacob G.


If you are using Java 8 or later, then this is a simple way to sort the characters in a string while respecting (not breaking) multi-char codepoints:

int[] codepoints = someString.codePoints().sort().toArray();
String sorted = new String(codepoints, 0, codepoints.length);

Prior to Java 8, I think you either need to use a loop to iterate the code points in the original string, or use a 3rd-party library method.


Fortunately, sorting the codepoints in a String is uncommon enough that the clunkyness and relative inefficiency of the solutions above are rarely a concern.

(When was the last time you tested for anagrams of emojis?)

like image 6
Stephen C Avatar answered Nov 01 '22 20:11

Stephen C