Background
I am trying getting clipboard data in HTML data flavor using Java. Thus I copy them into the clipboard from browsers. Then I am using java.awt.datatransfer.Clipboard to get them.
This works properly in Windows systems. But in Ubuntu there are some strange issues. The worst is when copied the data into clipboard from Firefox browser.
Example for reproducing the behavior
Java code:
import java.io.*;
import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;
public class WorkingWithClipboadData {
static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {
String fileName = "Result " + number + " " + paramCharset + ".txt";
OutputStream fileOut = new FileOutputStream(fileName);
fileOut.write(dataBytes, 0, dataBytes.length);
fileOut.close();
}
public static void main(String[] args) throws Exception {
Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();
int count = 0;
for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {
System.out.println(dataFlavor);
String mimeType = dataFlavor.getHumanPresentableName();
if ("text/html".equalsIgnoreCase(mimeType)) {
String paramClass = dataFlavor.getParameter("class");
if ("java.io.InputStream".equals(paramClass)) {
String paramCharset = dataFlavor.getParameter("charset");
if (paramCharset != null && paramCharset.startsWith("UTF")) {
System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");
InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);
ByteArrayOutputStream data = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int length = -1;
while ((length = inputStream.read(buffer)) != -1) {
data.write(buffer, 0, length);
}
data.flush();
inputStream.close();
byte[] dataBytes = data.toByteArray();
data.close();
doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);
}
}
}
}
}
}
Problem description
What I am doing is, opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Firefox. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program. After that the resulting files (only some of them as examples) looks like this:
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt"
00000000: feff fffd fffd 006c 0000 0065 0000 0074 .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073 ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069 ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f ...>.......<.../
00000040: 0000 0069 0000 003e 0000 ...i...>..
OK the FEFF
at the start looks like a UTF-16BE
byte-order-mark. But what is the FFFD
? And why are there those 0000
bytes between the single letters? UTF-16
encoding of l
is 006C
only. Seems as if all letters are encoded in 32 bit. But this is wrong for UTF-16
. And all non ASCII charcters are encoded with FFFD 0000
and so are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt"
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500 ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00 ..<./.i.>.
Here the EFBF BDEF BFBD
does not look like any known byte-order-mark. And all letters seems encoded in 16 bit, which is the double of the needed bits in UTF-8
. So the bits used seems always be the double count as needed. See in UTF-16
example above. And all not ASCII letters are encoded as EFBFBD
and so also are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt"
00000000: fffd fffd 006c 0000 0065 0000 0074 0000 .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000 .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000 .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000 .>.......<.../..
00000040: 0069 0000 003e 0000 .i...>..
Same picture as in the examples above. All letters are encoded using 32 bit. Only 16 bit shall be used in UTF-16
except the supplementary characters which uses surrogate pairs. And all not ASCII letters are encoded with FFFD 0000
and so are lost.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt"
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000 ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000 t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000 :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000 >.......<.../...
00000040: 6900 0000 3e00 0000 i...>...
Only for to be complete. Same picture as above.
So the conclusion is that the Ubuntu clipboard is totally messed up after copying something into it from Firefox. At least for HTML data flavors and when reading the clipboard using Java.
Other browser used
When I do the same things using Chromium browser as the source of the data, then the problems becomes smaller.
So I am opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Chromium. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program.
The result looks like:
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt"
00000000: feff 003c 006d 0065 0074 0061 0020 0068 ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f .a.l.;.".>...<./
00000810: 0069 003e 0000 .i.>..
Chromium has more HTML around the selected in the HTML data flavors in clipboard. But the encoding looks properly. Also for the not ASCII ä
= 00E4
. But there also is a small problem, There are additional bytes 0000
at the end which should not be there. In UTF-16
there are 2 additional 00
bytes at the end.
axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt"
00000000: 3c6d 6574 6120 6874 7470 2d65 7175 6976 <meta http-equiv
...
000003f0: 696f 6e2d 636f 6c6f 723a 2069 6e69 7469 ion-color: initi
00000400: 616c 3b22 3ec3 a43c 2f69 3e00 al;">..</i>.
Same as above. Encoding looks properly for UTF-8
. But here also is one additional 00
byte at the end which not should be there.
Environment
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
Mozilla Firefox 61.0.1 (64-Bit)
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
Questions
Am I doing something wrong in my code?
Can someone advise how to avoid that messed up content in clipboard? Since the not ASCII characters are lost, at least when copied from Firefox, I don't think that we can repair this content.
Is this a known issue somehow? Can someone confirm the same behavior? If so, is there already a bug report in Firefox about this?
Or is this a problem which only occurs if Java code reads the clipboard content? Seems as if. Because if I copy content from Firefox and paste it in Libreoffice Writer then Unicode appears properly. And if I then copy content from Writer to the clipboard and do reading it using my Java program, then UTF
encodings are correct except the additional 00
bytes at the end. So clipboard content copied from Writer behaves like content copied from Chromium browser.
New insights
The bytes 0xFFFD
seems to be Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). So the 0xFDFF
is the little endian representation of this and the 0xEFBFBD
is the UTF-8 encoding of this. So all results seems to be results of wrong decoding and re encoding Unicode.
Seems as it the clipboard content coming from Firefox is UTF-16LE
with BOM
always. But then Java
gets this as UTF-8
. So the 2 byte BOM becomes two messed up characters, which are replaced with 0xEFBFBD, each additional 0x00
sequence becomes their own NUL
characters and all byte sequences which are not proper UTF-8
byte sequences becomes messed up characters, which are replaced with 0xEFBFBD. Then this pseudo UTF-8 will be re encoded. Now the garbage is complete.
Example:
The sequence aɛaüa
in UTF-16LE with BOM will be
0xFFFE 6100 5B02 6100 FC00 6100
.
This taken as UTF-8 (0xEFBFBD = not a proper UTF-8 byte sequence) =
0xEFBFBD 0xEFBFBD a
NUL
[
STX
a
NUL
0xEFBFBD NUL
a
NUL
.
This pseudo ASCII re encoded to UTF-16LE will be:
0xFDFF FDFF 6100 0000 5B00 0200 6100 0000 FDFF 0000 6100 0000
This pseudo ASCII re encoded to UTF-8 will be
0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100
And this is exactly what happens.
Other examples:
Â
= 0x00C2 = C200
in UTF-16LE = 0xEFBFBD00 in pseudo UTF-8
胂
= 0x80C2 = C280
in UTF-16LE = 0xC280 in pseudo UTF-8
So I think Firefox
is not to blame for this but either Ubuntu
or Java
's runtime environment. And because copy/paste from Firefox to Writer works in Ubuntu, I think Java
's runtime environment does not handle the Firefox data flavors in Ubuntu
clipboard correctly.
New insights:
I have compared the flavormap.properties
files of my Windows 10
and my Ubuntu
and there is a difference. In Ubuntu
the native name of the text/html
is UTF8_STRING
while in Windows
it is HTML Format
. So I thought that this may be the problem. So I've added the line
HTML\ Format=text/html;charset=utf-8;eoln="\n";terminators=0
to my flavormap.properties
file in Ubuntu
.
After that:
Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
new DataFlavor[]{
new DataFlavor("text/html;charset=UTF-16LE")
});
System.out.println(nativesForFlavors);
prints
{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}
But no changes in the results of the Ubuntu clipboard content when read by Java.
If Firefox is not running: Hold down the Shift key when starting Firefox. (On Mac, hold down the option/alt key instead of the Shift key.) If Firefox is running: You can restart Firefox in Safe Mode using either: "3-bar" menu button > "?" Help > Restart with Add-ons Disabled.
In order to access stored clipboard items, please open toolbar popup or right-click on an editable area and then choose clipboard manager in the right-click. Then, click on a desired clipboard item; and it will be inserted to the editable filed.
Using a Chrome or Firefox Extension. Install the Absolute Enable Right Click & Copy extension. This Chrome browser extension allows you to copy and paste any text from a website in any Chromium-based web browser on your computer, including Google Chrome and Microsoft Edge, even if you can't select text or right-click.
After looking at this quite a bit it looks like this is a longstanding bug with Java (even older report here).
It looks like with the X11 Java components expect clipboard data to always be UTF-8 encoded and Firefox encodes data with UTF-16. Because of the assumptions Java makes it mangles the text by forcing parsing UTF-16 as UTF-8. I tried but couldn't find a good way to bypass the issue. The "text" part of "text/html" seems to indicate to Java that the bytes received from the clipboard should always be interpreted as text first and then offered in the various flavors. I couldn't find any straight forward way to access the pre-converted byte array from X11.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With