base64 decode string - emacs different than jvm?

Question

With the base64-encoded string JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN I am getting difference results from emacs than from the clojure code below.

Can anyone explain to me why?

The elisp below gives the correct output, giving me ultimately a valid pdf document (when i past the entire string). I am sure my emacs buffer is set to utf-8:

(base64-decode-string "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

"%PDF-1.1
 %âãÏÓ
 1 0 obj
 <<

Here is the same output with the chars in decimal (i think):

  "%PDF-1.1
  %\342\343\317\323
  1

The clojure below gives incorrect output, rendering the pdf document invalid when i give the entire string:

(import 'java.util.Base64 )

(defn decode  [to-decode]
  (let [
        byts           (.getBytes to-decode "UTF-8")
        decoded        (.decode (java.util.Base64/getDecoder) byts)
        ]
    (String. decoded "UTF-8")))


(decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

"%PDF-1.1
%����
1 0 obj
<<

Same output, chars in decimal (i think). I couldn't even copy/paste this, i had to type it in. This is what it looks like when i opened the PDF in text-mode for the first three columns:

 "%PDF-1.1
  %\357\277\275\357\277\275\357\277\275\357\277\275
  1"

Edit Taking emacs out of the equation:

If i write the encoded string to a file called encoded.txt and pipe it through the linux program base64 --decode i get valid output and a good pdf also: This is clojure:

(defn decode  [to-decode]
  (let [byts        (.getBytes to-decode "ASCII")
        decoded     (.decode (java.util.Base64/getDecoder) byts)
        flip-negatives  #(if (neg? %) (char (+ 255 %)) (char %))
        ]
    (String. (char-array (map flip-negatives decoded)) )))

(spit "./output/decoded.pdf" (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))

(spit "./output/encoded.txt" "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")

Then this at the shell:

➜  output git:(master) ✗ cat encoded.txt| base64 --decode > decoded2.pdf 
➜  output git:(master) ✗ diff decoded.pdf decoded2.pdf 
2c2
< %áâÎÒ
---
> %����
➜  output git:(master) ✗

update - this seems to work

Alan Thompson's answer below put me on the correct track, but geez what a pain to get there. Here's the idea of what works:

(def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" ))

(as-> some-giant-string-i-hate-at-this-point $
  (.getBytes $)
  (String. $   iso-latin-1-charset)
  (base64/decode $ "ISO-8859-1")
  (spit "./output/a-pdf-that-actually-works.pdf" $ :encoding "ISO-8859-1" ))

Alan Thompson · Accepted Answer

Returning the results as a string, I get:

(b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")  
  => "%PDF-1.1
%����
1 0 obj
<< 
"

and as a vector of ints:

(mapv int (b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")) 

  => [37 80 68 70 45 49 46 49 13 10 37 65533 65533 65533 65533 13 10 49 32 48 
      32 111 98 106 13 10 60 60 32 13]

Since both the beginning and end of the string look OK, I suspect the B64 string might be malformed?

Update

I went to http://www.base64decode.org and got the result

"Malformed input... :("

enter image description here

Update #2

The root of the problem is that the source characters are not UTF-8 encoded. Rather, they are ISO-8859-1 (aka ISO-LATIN-1) encoded. See this code:

  (defn decode-bytes
    "Decodes a byte array from base64, returning a new byte array."
    [code-bytes]
    (.decode (java.util.Base64/getDecoder) code-bytes))

  (def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" )) ; aka ISO-LATIN-1

  (let [b64-str         "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"
        bytes-default   (vec (.getBytes b64-str))
        bytes-8859      (vec (.getBytes b64-str iso-latin-1-charset))

        src-byte-array  (decode-bytes (byte-array bytes-default))
        src-bytes       (vec src-byte-array)
        src-str-8859    (String. src-byte-array iso-latin-1-charset)
        ]...  ))

with result:

iso-latin-1-charset => <#sun.nio.cs.ISO_8859_1 #object[sun.nio.cs.ISO_8859_1 0x3edbd6e8 "ISO-8859-1"]>

bytes-default  => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]
bytes-8859     => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]

(= bytes-default bytes-8859) => true

src-bytes      => [37 80 68 70 45 49 46 49 13 10 37 -30 -29 -49 -45 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13]
src-str-8859   => "%PDF-1.1
%âãÏÓ
1 0 obj
<< 
"

So the java.lang.String constructor will work correctly with a byte[] input, even when the high bit is set (making them look like "negative" values), as long as you tell the constructor the correct java.nio.charset.Charset to use for interpreting the values.

Interesting that the object type is sun.nio.cs.ISO_8859_1.

Update #3

See the SO question below for a list of libraries that can (usually) autodetect the encoding of a byte stream (e.g. UTF-8, ISO-8859-1, ...)

What is the most accurate encoding detector?

base64 decode string - emacs different than jvm?

Tags:

base64

clojure

Edit Taking emacs out of the equation:

update - this seems to work

joefromct

1 Answers

Update

Update #2

Update #3

Alan Thompson

Recent Activity

Donate For Us

base64 decode string - emacs different than jvm?

Tags:

base64

clojure

Edit Taking emacs out of the equation:

update - this seems to work

joefromct

1 Answers

Update

Update #2

Update #3

Alan Thompson

Related questions

Recent Activity

Donate For Us