Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What encoding Facebook uses in JSON files from data export?

Tags:

json

facebook

I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.

Here's an example of such a string:

"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"

When I try parse the string for example with javascript's JSON.parse() and print it out I get:

"nejnižší bod: 0 mnm Benátky\n"

While it should be

"nejnižší bod: 0 mnm Benátky\n"

I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.

I've been able to figure out these characters so far:

'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',

So what is this weird encoding? Is there any known tool that can correctly decode it?

like image 335
Jen Avatar asked Oct 10 '18 19:10

Jen


People also ask

What encoding does Facebook use?

The new model still uses a set of quick initial H264 ABR encodings to ensure that all uploaded videos are encoded at good quality as soon as possible.

What encoding do JSON files use?

The default encoding is UTF-8. (in §6) JSON may be represented using UTF-8, UTF-16, or UTF-32. When JSON is written in UTF-8, JSON is 8bit compatible. When JSON is written in UTF-16 or UTF-32, the binary content-transfer-encoding must be used.

What is a JSON export?

Trello's export tools offer the option to export your data as JavaScript Object Notation (JSON). JSON is primarily intended as a data format that machines can interpret and use. As a result although a JSON file is somewhat human readable, it's not as easy to parse as what you might expect from an Excel or CSV file.


4 Answers

Thanks to Jen's excellent question and Shawn's comment.

Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.

What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.

This is how I do it in Ruby (see gist):

require 'json'
require 'uri'

bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/

txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
  $1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end

good_data = JSON.load(txt)

With bytes_re we catch all sequences of bad Unicode characters.

Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.

The [1...-1] is to remove quotes inserted by #to_json

I wanted to explain the code because question is not ruby specific and reader may use another language.

I guess somebody can do it with a sufficiently ugly sed command..

like image 59
akostadinov Avatar answered Oct 04 '22 15:10

akostadinov


The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:

function decode(s) {
   let d = new TextDecoder;
   let a = s.split('').map(r => r.charCodeAt());
   return d.decode(new Uint8Array(a));
}

let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);

https://developer.mozilla.org/docs/Web/API/TextDecoder

like image 35
Zombo Avatar answered Oct 10 '22 16:10

Zombo


You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8

The following code should work in python3.x:

import re

re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)
like image 2
Varun Mathur Avatar answered Oct 10 '22 16:10

Varun Mathur


The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.

This command fixes a file like this in Emacs:

(defun k/format-facebook-backup ()
  "Normalize a Facebook backup JSON file."
  (interactive)
  (save-excursion
    (goto-char (point-min))
    (let ((inhibit-read-only t)
          (size (point-max))
          bounds str)
      (while (search-forward "\"\\u" nil t)
        (message "%.f%%" (* 100 (/ (point) size 1.0)))
        (setq bounds (bounds-of-thing-at-point 'string))
        (when bounds
          (setq str (--> (json-parse-string (buffer-substring (car bounds)
                                                              (cdr bounds)))
                         (string-to-list it)
                         (apply #'unibyte-string it)
                         (decode-coding-string it 'utf-8)))
          (setf (buffer-substring (car bounds) (cdr bounds))
                (json-serialize str))))))
  (save-buffer))
like image 2
Kisaragi Hiu Avatar answered Oct 10 '22 17:10

Kisaragi Hiu