I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way. Here's an example of such a string: <code>"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"</code> When I try parse the string for example with javascript's <code>JSON.parse()</code> and print it out I get: <code>"nejniÅ¾Å¡Ã bod: 0 mnm BenÃ¡tky\n"</code> While it should be <code>"nejniž&scaron;í bod: 0 mnm Benátky\n"</code> I can see that <code>\u00c5\u00be</code> should somehow correspond to <code>ž</code> but I can't figure out the general pattern. I've been able to figure out these characters so far: <pre class="prettyprint"><code>'\u00c2\u00b0' : '°', '\u00c3\u0081' : 'Á', '\u00c3\u00a1' : 'á', '\u00c3\u0089' : 'É', '\u00c3\u00a9' : 'é', '\u00c3\u00ad' : 'í', '\u00c3\u00ba' : 'ú', '\u00c3\u00bd' : 'ý', '\u00c4\u008c' : 'Č', '\u00c4\u008d' : 'č', '\u00c4\u008f' : 'ď', '\u00c4\u009b' : 'ě', '\u00c5\u0098' : 'Ř', '\u00c5\u0099' : 'ř', '\u00c5\u00a0' : '&Scaron;', '\u00c5\u00a1' : '&scaron;', '\u00c5\u00af' : 'ů', '\u00c5\u00be' : 'ž', </code></pre> So what is this weird encoding? Is there any known tool that can correctly decode it?

Thanks to Jen's excellent question and Shawn's comment. Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points. What we need to do is take last two characters of each sextet (e.g. <code>c3</code> from <code>\u00c3</code>), concatenate them together and read as a Unicode string. This is how I do it in Ruby (see gist): <pre class="prettyprint"><code>require 'json' require 'uri' bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/ txt = File.read('export.json').gsub(bytes_re) do |bad_unicode| $1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1] end good_data = JSON.load(txt) </code></pre> With <code>bytes_re</code> we catch all sequences of bad Unicode characters. Then for each sequence replace '\u00' with '\x' (e.g. <code>\xc3</code>), put quotes around it <code>"</code> and use Ruby's built-in string parsing so that the <code>\xc3\xbe...</code> strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the <code>#to_json</code> method. The <code>[1...-1]</code> is to remove quotes inserted by <code>#to_json</code> I wanted to explain the code because question is not ruby specific and reader may use another language. I guess somebody can do it with a sufficiently ugly <code>sed</code> command..

The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16: <div class="snippet" data-lang="js" data-hide="false" data-console="true" data-babel="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>function decode(s) { let d = new TextDecoder; let a = s.split('').map(r => r.charCodeAt()); return d.decode(new Uint8Array(a)); } let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"; s = decode(s); console.log(s);</code></pre> </div> </div> https://developer.mozilla.org/docs/Web/API/TextDecoder

You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8 The following code should work in python3.x: <pre class="prettyprint"><code>import re re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s) </code></pre>

What encoding Facebook uses in JSON files from data export?

Tags:

json

facebook

I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.

Here's an example of such a string:

"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"

When I try parse the string for example with javascript's JSON.parse() and print it out I get:

"nejniÅ¾Å¡Ã bod: 0 mnm BenÃ¡tky\n"

While it should be

"nejnižší bod: 0 mnm Benátky\n"

I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.

I've been able to figure out these characters so far:

'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',

So what is this weird encoding? Is there any known tool that can correctly decode it?

335

asked Oct 10 '18 19:10

Jen

4 Answers

Thanks to Jen's excellent question and Shawn's comment.

Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.

What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.

This is how I do it in Ruby (see gist):

require 'json'
require 'uri'

bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/

txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
  $1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end

good_data = JSON.load(txt)

With bytes_re we catch all sequences of bad Unicode characters.

Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.

The [1...-1] is to remove quotes inserted by #to_json

I wanted to explain the code because question is not ruby specific and reader may use another language.

I guess somebody can do it with a sufficiently ugly sed command..

answered Oct 04 '22 15:10

akostadinov

The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:

function decode(s) {
   let d = new TextDecoder;
   let a = s.split('').map(r => r.charCodeAt());
   return d.decode(new Uint8Array(a));
}

let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);

https://developer.mozilla.org/docs/Web/API/TextDecoder

answered Oct 10 '22 16:10

Zombo

You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8

The following code should work in python3.x:

import re

re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)

answered Oct 10 '22 16:10

Varun Mathur

The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.

This command fixes a file like this in Emacs:

(defun k/format-facebook-backup ()
  "Normalize a Facebook backup JSON file."
  (interactive)
  (save-excursion
    (goto-char (point-min))
    (let ((inhibit-read-only t)
          (size (point-max))
          bounds str)
      (while (search-forward "\"\\u" nil t)
        (message "%.f%%" (* 100 (/ (point) size 1.0)))
        (setq bounds (bounds-of-thing-at-point 'string))
        (when bounds
          (setq str (--> (json-parse-string (buffer-substring (car bounds)
                                                              (cdr bounds)))
                         (string-to-list it)
                         (apply #'unibyte-string it)
                         (decode-coding-string it 'utf-8)))
          (setf (buffer-substring (car bounds) (cdr bounds))
                (json-serialize str))))))
  (save-buffer))

answered Oct 10 '22 17:10

Kisaragi Hiu

Related questions
                            
                                JSON REST Api Pagination page out of bounds or empty collection response code
                            
                                Jackson: @JsonIdentityInfo Object instead of id
                            
                                How can I stream a JSON Array from NodeJS to postgres
                            
                                Parse JSON to Typescript class in Angular app
                            
                                What are the advantages of returning JSON response in Spring REST Using MappingJackson2JsonView Support over @ResponseBody Annotation?
                            
                                Is it OK to use ng-repeat again and again?
                            
                                Convert JSON array to XML
                            
                                Cannot read property 'toLowerCase' of undefined (Angularjs/JavaScript/Json)
                            
                                Editing JSON search results from within Angular
                            
                                How to validate JSON and show positions of any errors?
                            
                                Read field from parent node in custom Jackson Deserializer
                            
                                how to use a json file in a typescript file
                            
                                What is the alternative of Jackson in python?
                            
                                How to test if Native app install banner works
                            
                                How to find the difference/mismatch between two JSON file?
                            
                                Why is JSON document not fully consumed?
                            
                                Best practices for POSTing nested objects to REST server
                            
                                Jackson JSON serialization: How to ignore a nested object when all its fields are null?
                            
                                possible ways to store JSON in dynamodb
                            
                                How do you escape characters within a string (JSON format)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With