Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clipboard content is messed up when copied from Firefox and read using Java in Ubuntu

Tags:

Background

I am trying getting clipboard data in HTML data flavor using Java. Thus I copy them into the clipboard from browsers. Then I am using java.awt.datatransfer.Clipboard to get them.

This works properly in Windows systems. But in Ubuntu there are some strange issues. The worst is when copied the data into clipboard from Firefox browser.

Example for reproducing the behavior

Java code:

import java.io.*;

import java.awt.Toolkit;
import java.awt.datatransfer.Clipboard;
import java.awt.datatransfer.DataFlavor;

public class WorkingWithClipboadData {

 static void doSomethingWithBytesFromClipboard(byte[] dataBytes, String paramCharset, int number) throws Exception {

  String fileName = "Result " + number + " " + paramCharset + ".txt";

  OutputStream fileOut = new FileOutputStream(fileName);
  fileOut.write(dataBytes, 0, dataBytes.length);
  fileOut.close();

 }

 public static void main(String[] args) throws Exception {

  Clipboard clipboard = Toolkit.getDefaultToolkit().getSystemClipboard();

  int count = 0;

  for (DataFlavor dataFlavor : clipboard.getAvailableDataFlavors()) {

System.out.println(dataFlavor);

   String mimeType = dataFlavor.getHumanPresentableName();
   if ("text/html".equalsIgnoreCase(mimeType)) {
    String paramClass = dataFlavor.getParameter("class");
    if ("java.io.InputStream".equals(paramClass)) {
     String paramCharset = dataFlavor.getParameter("charset");
     if (paramCharset != null  && paramCharset.startsWith("UTF")) {

System.out.println("============================================");
System.out.println(paramCharset);
System.out.println("============================================");

      InputStream inputStream = (InputStream)clipboard.getData(dataFlavor);

      ByteArrayOutputStream data = new ByteArrayOutputStream();

      byte[] buffer = new byte[1024];
      int length = -1;
      while ((length = inputStream.read(buffer)) != -1) {
       data.write(buffer, 0, length);
      }
      data.flush();
      inputStream.close();

      byte[] dataBytes = data.toByteArray();
      data.close();

      doSomethingWithBytesFromClipboard(dataBytes, paramCharset, ++count);

     }
    }
   }
  }
 }

}

Problem description

What I am doing is, opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Firefox. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program. After that the resulting files (only some of them as examples) looks like this:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff fffd fffd 006c 0000 0065 0000 0074  .......l...e...t
00000010: 0000 0074 0000 0065 0000 0072 0000 0073  ...t...e...r...s
00000020: 0000 003a 0000 0020 0000 003c 0000 0069  ...:... ...<...i
00000030: 0000 003e 0000 fffd 0000 003c 0000 002f  ...>.......<.../
00000040: 0000 0069 0000 003e 0000                 ...i...>..

OK the FEFF at the start looks like a UTF-16BE byte-order-mark. But what is the FFFD? And why are there those 0000 bytes between the single letters? UTF-16 encoding of l is 006C only. Seems as if all letters are encoded in 32 bit. But this is wrong for UTF-16. And all non ASCII charcters are encoded with FFFD 0000 and so are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: efbf bdef bfbd 6c00 6500 7400 7400 6500  ......l.e.t.t.e.
00000010: 7200 7300 3a00 2000 3c00 6900 3e00 efbf  r.s.:. .<.i.>...
00000020: bd00 3c00 2f00 6900 3e00                 ..<./.i.>.

Here the EFBF BDEF BFBD does not look like any known byte-order-mark. And all letters seems encoded in 16 bit, which is the double of the needed bits in UTF-8. So the bits used seems always be the double count as needed. See in UTF-16 example above. And all not ASCII letters are encoded as EFBFBD and so also are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 7 UTF-16BE.txt" 
00000000: fffd fffd 006c 0000 0065 0000 0074 0000  .....l...e...t..
00000010: 0074 0000 0065 0000 0072 0000 0073 0000  .t...e...r...s..
00000020: 003a 0000 0020 0000 003c 0000 0069 0000  .:... ...<...i..
00000030: 003e 0000 fffd 0000 003c 0000 002f 0000  .>.......<.../..
00000040: 0069 0000 003e 0000                      .i...>..

Same picture as in the examples above. All letters are encoded using 32 bit. Only 16 bit shall be used in UTF-16 except the supplementary characters which uses surrogate pairs. And all not ASCII letters are encoded with FFFD 0000 and so are lost.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 10 UTF-16LE.txt" 
00000000: fdff fdff 6c00 0000 6500 0000 7400 0000  ....l...e...t...
00000010: 7400 0000 6500 0000 7200 0000 7300 0000  t...e...r...s...
00000020: 3a00 0000 2000 0000 3c00 0000 6900 0000  :... ...<...i...
00000030: 3e00 0000 fdff 0000 3c00 0000 2f00 0000  >.......<.../...
00000040: 6900 0000 3e00 0000                      i...>...

Only for to be complete. Same picture as above.

So the conclusion is that the Ubuntu clipboard is totally messed up after copying something into it from Firefox. At least for HTML data flavors and when reading the clipboard using Java.

Other browser used

When I do the same things using Chromium browser as the source of the data, then the problems becomes smaller.

So I am opening URL https://en.wikipedia.org/wiki/Germanic_umlaut in Chromium. Then I do select "letters: ä" there and copy this into clipboard. Then I run my Java program.

The result looks like:

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 1 UTF-16.txt" 
00000000: feff 003c 006d 0065 0074 0061 0020 0068  ...<.m.e.t.a. .h
...
00000800: 0061 006c 003b 0022 003e 00e4 003c 002f  .a.l.;.".>...<./
00000810: 0069 003e 0000                           .i.>..

Chromium has more HTML around the selected in the HTML data flavors in clipboard. But the encoding looks properly. Also for the not ASCII ä = 00E4. But there also is a small problem, There are additional bytes 0000 at the end which should not be there. In UTF-16 there are 2 additional 00 bytes at the end.

axel@arichter:~/Dokumente/JAVA/poi/poi-3.17$ xxd "./Result 4 UTF-8.txt" 
00000000: 3c6d 6574 6120 6874 7470 2d65 7175 6976  <meta http-equiv
...
000003f0: 696f 6e2d 636f 6c6f 723a 2069 6e69 7469  ion-color: initi
00000400: 616c 3b22 3ec3 a43c 2f69 3e00            al;">..</i>.

Same as above. Encoding looks properly for UTF-8. But here also is one additional 00 byte at the end which not should be there.

Environment

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"


Mozilla Firefox 61.0.1 (64-Bit)


java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Questions

Am I doing something wrong in my code?

Can someone advise how to avoid that messed up content in clipboard? Since the not ASCII characters are lost, at least when copied from Firefox, I don't think that we can repair this content.

Is this a known issue somehow? Can someone confirm the same behavior? If so, is there already a bug report in Firefox about this?

Or is this a problem which only occurs if Java code reads the clipboard content? Seems as if. Because if I copy content from Firefox and paste it in Libreoffice Writer then Unicode appears properly. And if I then copy content from Writer to the clipboard and do reading it using my Java program, then UTF encodings are correct except the additional 00 bytes at the end. So clipboard content copied from Writer behaves like content copied from Chromium browser.


New insights

The bytes 0xFFFD seems to be Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). So the 0xFDFF is the little endian representation of this and the 0xEFBFBD is the UTF-8 encoding of this. So all results seems to be results of wrong decoding and re encoding Unicode.

Seems as it the clipboard content coming from Firefox is UTF-16LE with BOM always. But then Java gets this as UTF-8. So the 2 byte BOM becomes two messed up characters, which are replaced with 0xEFBFBD, each additional 0x00 sequence becomes their own NUL characters and all byte sequences which are not proper UTF-8 byte sequences becomes messed up characters, which are replaced with 0xEFBFBD. Then this pseudo UTF-8 will be re encoded. Now the garbage is complete.

Example:

The sequence aɛaüa in UTF-16LE with BOM will be 0xFFFE 6100 5B02 6100 FC00 6100.

This taken as UTF-8 (0xEFBFBD = not a proper UTF-8 byte sequence) = 0xEFBFBD 0xEFBFBD a NUL [ STX a NUL 0xEFBFBD NUL a NUL.

This pseudo ASCII re encoded to UTF-16LE will be: 0xFDFF FDFF 6100 0000 5B00 0200 6100 0000 FDFF 0000 6100 0000

This pseudo ASCII re encoded to UTF-8 will be 0xEFBF BDEF BFBD 6100 5B02 6100 EFBF BD00 6100

And this is exactly what happens.

Other examples:

 = 0x00C2 = C200 in UTF-16LE = 0xEFBFBD00 in pseudo UTF-8

= 0x80C2 = C280 in UTF-16LE = 0xC280 in pseudo UTF-8

So I think Firefox is not to blame for this but either Ubuntu or Java's runtime environment. And because copy/paste from Firefox to Writer works in Ubuntu, I think Java's runtime environment does not handle the Firefox data flavors in Ubuntu clipboard correctly.


New insights:

I have compared the flavormap.properties files of my Windows 10 and my Ubuntu and there is a difference. In Ubuntu the native name of the text/html is UTF8_STRING while in Windows it is HTML Format. So I thought that this may be the problem. So I've added the line

HTML\ Format=text/html;charset=utf-8;eoln="\n";terminators=0

to my flavormap.properties file in Ubuntu.

After that:

Map<DataFlavor,String> nativesForFlavors = SystemFlavorMap.getDefaultFlavorMap().getNativesForFlavors(
   new DataFlavor[]{
   new DataFlavor("text/html;charset=UTF-16LE")
   });

System.out.println(nativesForFlavors);

prints

{java.awt.datatransfer.DataFlavor[mimetype=text/html;representationclass=java.io.InputStream;charset=UTF-16LE]=HTML Format}

But no changes in the results of the Ubuntu clipboard content when read by Java.

like image 839
Axel Richter Avatar asked Jul 21 '18 09:07

Axel Richter


People also ask

Why copy and paste not working in Firefox?

If Firefox is not running: Hold down the Shift key when starting Firefox. (On Mac, hold down the option/alt key instead of the Shift key.) If Firefox is running: You can restart Firefox in Safe Mode using either: "3-bar" menu button > "?" Help > Restart with Add-ons Disabled.

Is there a clipboard in Firefox?

In order to access stored clipboard items, please open toolbar popup or right-click on an editable area and then choose clipboard manager in the right-click. Then, click on a desired clipboard item; and it will be inserted to the editable filed.

How do I copy text from a website that won't let you on Firefox?

Using a Chrome or Firefox Extension. Install the Absolute Enable Right Click & Copy extension. This Chrome browser extension allows you to copy and paste any text from a website in any Chromium-based web browser on your computer, including Google Chrome and Microsoft Edge, even if you can't select text or right-click.


1 Answers

After looking at this quite a bit it looks like this is a longstanding bug with Java (even older report here).

It looks like with the X11 Java components expect clipboard data to always be UTF-8 encoded and Firefox encodes data with UTF-16. Because of the assumptions Java makes it mangles the text by forcing parsing UTF-16 as UTF-8. I tried but couldn't find a good way to bypass the issue. The "text" part of "text/html" seems to indicate to Java that the bytes received from the clipboard should always be interpreted as text first and then offered in the various flavors. I couldn't find any straight forward way to access the pre-converted byte array from X11.

like image 121
Michael Powers Avatar answered Oct 29 '22 22:10

Michael Powers