Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode supplementary planes in JavaFX

I'm having problems dealing with Unicode characters from supplementary ("astral") planes in JavaFX. Specifically, I can't paste such characters in a TextInputDialog (I get some weird characters instead, such as ð), and can't use them in a WebView (they get rendered as ������).

The same characters are working perfectly fine if I input them via JOptionPane.showInputDialog and print them to the console. They even show in a JavaFX Alert, although it appends some junk at the end.

Is there a way to fix these problems?

I'm using Oracle JDK version 1.8.0_51 in Linux.
Examples of supplementary plane characters: 😀 𐂃 🂡 🙭 𫞂
If you can't see them, you may need to install additional fonts such as Symbola or Noto.

Here's an example program (using a Label rather than a WebView):

import javax.swing.JOptionPane;

import javafx.application.Application;
import javafx.scene.Scene;
import javafx.scene.control.Alert;
import javafx.scene.control.Alert.AlertType;
import javafx.scene.control.Label;
import javafx.scene.control.TextInputDialog;
import javafx.scene.layout.StackPane;
import javafx.stage.Stage;

public class UniTest extends Application {
    @Override
    public void start(final Stage stage) throws Exception {
        final String s = new String(new int[]{127137, 178050, 3232, 128512, 241}, 0, 5);
        System.out.println("The string: " + s);
        System.out.println("Characters: " + s.length());
        System.out.println("Code points: " + s.codePoints().count());

        JOptionPane.showMessageDialog(null, s, "JOptionPane", JOptionPane.INFORMATION_MESSAGE);

        final Alert al = new Alert(AlertType.INFORMATION);
        al.setTitle("Alert");
        al.setContentText(s);
        al.showAndWait();

        final TextInputDialog dlg = new TextInputDialog();
        dlg.setTitle("TextInputDialog");
        dlg.setContentText("Try to paste the string in here");
        dlg.showAndWait().ifPresent(x -> System.out.println("Your input: " + x));

        final StackPane root = new StackPane();
        root.getChildren().add(new Label(s));
        stage.setScene(new Scene(root, 400, 300));
        stage.setTitle("Stage");
        stage.show();
    }

    public static void main(final String... args) {
        launch(args);
    }
}

And here are the results I get:

screenshots

Note: not all the characters in the example are from supplementary planes, and one of the characters is only rendered correctly in the console.

like image 256
aditsu quit because SE is EVIL Avatar asked Oct 19 '22 00:10

aditsu quit because SE is EVIL


1 Answers

TL;DR: Evidently JavaFX is buggy.

Here is the text you are using.

🂡𫞂ಠ😀ñ

Decimal codepoint representation:

127137 178050 3232 128512 241

Hex representation:

0x1F0A1 0x2B782 0xCA0 0x1F600 0xF1

Display Bug

Java uses UTF-16 internally. So consider the UTF-16 representation:

UTF-16 representation:

D83C DCA1 D86D DF82 0CA0 D83D DE00 00F1

We can see that the display is showing the five characters you expect, but then three garbage characters.

So it is clearly trying to display 8 glyphs, where there are only five. This is almost certainly because the display code is counting 8 characters, because three characters are encoded in UTF-16 as surrogate pairs, so take two 16-bit words each. In other words it is using the wrong value for the length of the string in the presence of surrogate pairs.

Pasted Text Bug

UTF-8 Representation of test data:

F0 9F 82 A1 F0 AB 9E 82 E0 B2 A0 F0 9F 98 80 C3 B1

What is seen is

00F0 ð LATIN SMALL LETTER ETH 
009F  <control> = APC = APPLICATION PROGRAM COMMAND 
0082  <control> = BPH = BREAK PERMITTED HERE
00A1 ¡ INVERTED EXCLAMATION MARK 
00F0 ð LATIN SMALL LETTER ETH 

(The two control characters can have glyphs in some fonts containing either their abbreviations or hex codes. These are visible in your example.)

Latin1 hex representation:

F0 9F 82 A1 F0

Note that these five bytes are the same as the first five bytes of the UTF-8 representation of the intended text.

Conclusion: The pasted data has been pasted as 5 UTF-8 codepoints occupying 17 bytes, but interpreted as 5 Latin1 codepoints occupying 5 bytes. Again, the wrong property has been used for the length.

like image 163
Ben Avatar answered Oct 28 '22 17:10

Ben