Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my Unicode String get corrupted, when passed from Java Applet to Java Script?

I'm pretty new, so don't be too harsh :)

Question(tl;dr)

I'm facing a problem passing an unicode String from an embedded javax.swing.JApplet in a web page to the Java Script part. I'm not sure this is whether a bug or a misunderstanding of the involved technologies:

Problem

I want to pass a unicode string from a Java Applet to Java Script, but the String gets messed up. Strangely, the problem doesn't occur not in Internet Explorer 10 but in Chrome (v26) and Firefox (v20). I haven't tested other browsers though.

The returned String seems to be okay, except for the last unicode character. The result in the Java Script Debugger and Web Page would be:

  • abc → abc
  • 表示 → 表��
  • ま → ま
  • ウォッチリスト → ウォッチリス��
  • アップロード → アップロー��
  • ホ → ��
  • ホ → ホ (Not deterministic)
  • アップロードabc → アップロードabc

The string seems to get corrupted at the last bytes. If it ends with an ASCII character the string is okay. Additionally the problem doesn't occur within every combination and also not every time (not sure on this). Therefore I suspect a bug and I'm afraid I might be posting an invalid question.

Test Set Up

A minimalistic set up includes an applet that returns some unicode (UTF-8) strings:

/* TestApplet.java */
import javax.swing.*;

public class TestApplet extends JApplet {

private String[] testStrings = {
            "abc", // OK (because ASCII only)
            "表示", // Error on last Character
            "表示", // Error on last Character
            "ホーム ", // OK (because of *space* after ム)
            "アップロード", ... }; 
    public TestApplet() {...};     // Applet specific stuff

    ...

    public int getLength() { return testStrings.length;};

    String getTestString(int i) {
        return testStrings[i];    // Build-in array functionality because of IE. 
    }
}

The corresponding web page with java script could look like this:

 /* test.html */
<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
        <span id="output"/>
        <applet id='output' archive='test.jar' code=testApplet/>
    </body>

    <script type="text/javascript" charset="utf-8">
        var applet = document.getElementById('output');
        var node = document.getElementById("1");
        for(var i = 0; i < applet.getLength(); i++) {
             var text = applet.getTestString(i);
         var paragraphNode = document.createElement("p");
         paragraphNode.innerHTML = text;
         node.appendChild(paragraphNode);
        }
    </script>
</html>

Environment

I'm working on Windows 7 32-Bit with the current Java Version 1.7.0_21 using the "Next Generation Java Plug-in 10.21.2 for Mozilla browsers". I had some problems with my operating system locale, but I tried several (English, Japanese, Chinese) regional settings.

In case of an corrupt String chrome shows invalid characters (e.g. ��). Firefox, on the other hand, drops the string completly, if it would be ending with ��.

Internet explorer manages to display the strings correctly.

Solutions?

I can imagine several workarounds, including escaping/unescaping and adding a "final char" which then is removed via java script. Actually I'm planning to write against Android's Webkit, and I haven't tested it there.

Since I would like to continue testing in Chrome, (because of Webkit technology and comfort) I hope there is a trivial solution to the problem, which I might have overlooked.

like image 542
Inuniku Avatar asked May 03 '13 13:05

Inuniku


2 Answers

If you are testing in Chrome/Firefox

Please replace first line with this and then test it,

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

The Doctype has significant value while browser identifies the page.

Transitional /loose it the types you can use with Unicode. Please test and reply..

like image 84
MarmiK Avatar answered Oct 04 '22 00:10

MarmiK


I suggest to set a breakpoint on

paragraphNode.innerHTML = text;

and inspect text it in the JavaScript console, e.g. with

console.log(escape(text));

or

console.log(encodeURIComponent(text));

or

for (i=0; i < text.length; i++) {
    console.log("i = "+i);
    console.log("text.charAt(i) = "+text.charAt(i)
    +", text.charCodeAt(i) = "+text.charCodeAt(i));
}

See also

http://www.fileformat.info/info/unicode/char/30a6/index.htm

https://developer.mozilla.org/en-US/docs/DOM/window.escape (which is not part of any standard)

and

https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/encodeURIComponent

or similar resources.

Your source files may not be in the encoding you assume (UTF-8).

JavaScript assumes UTF-16 strings:

http://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16

Java also assumes UTF-16:

http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html

The Linux or Cygwin file command can show you the encoding of your files.

See

http://linux.die.net/man/1/file (haven't found a kernel.org man reference)

like image 45
stackunderflow Avatar answered Oct 03 '22 22:10

stackunderflow