Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bad JSON - Keys are not quoted

Tags:

python

json

jsonp

I am scraping some JSONP dictionaries from AWS (from javascript files). After parsing the raw data for only the JSON-like data, in some cases I get a valid JSON and can successfully load this in Python (json_data = json.loads(json_like_data)). However, some of Amazon's JSONPs do not include quotes around their keys (see the following).

...
{type:"storageCurrentGen",sizes:
[{size:"i2.xlarge",vCPU:"4",ECU:"14",memoryGiB:"30.5",storageGB:"1 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"0.938"}}]},
{size:"i2.2xlarge",vCPU:"8",ECU:"27",memoryGiB:"61",storageGB:"2 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"1.876"}}]},
{size:"i2.4xlarge",vCPU:"16",ECU:"53",memoryGiB:"122",storageGB:"4 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"3.751"}}]},
...

For JSONP, this still works as it is valid JavaScript syntax. However, Python's json.loads(json_str) craps out as it is not valid JSON.

There is another Python module YAML which can handle unquoted keys, BUT there must be a space after the semicolons (:).

I figure that I have two options.

  1. Somehow replace character in between an open brace or comma ({ | ,) and a colon (:). Then use json.loads(...).
  2. Add a space after ever colon (:). Then parse with yaml.load(...).

My guess is that option 2 is better than 1. However, I am seeking suggestion of a better solution.

Has anyone encountered an ill-formatted JSON such as this before and used Python to parse it?

like image 416
dlstadther Avatar asked Jan 15 '16 14:01

dlstadther


2 Answers

You have an HJSON document, at which point you can use the hjson project to parse it:

>>> import hjson
>>> hjson.loads('{javascript_style:"Look ma, no quotes!"}')
OrderedDict([('javascript_style', 'Look ma, no quotes!')])

HJSON is JSON without the requirement to quote object names and even for certain string values, with added comment support and multi-line strings, and with relaxed rules on where commas should be used (including not using commas at all).

Or you could install and use the demjson library; it supports parsing valid JavaScript (missing quotes):

import demjson

result = demjson.decode(jsonp_payload)

Only when you set the strict=True flag does demjson refuse to parse your input:

>>> import demjson
>>> demjson.decode('{javascript_style:"Look ma, no quotes!"}')
{u'javascript_style': u'Look ma, no quotes!'}
>>> demjson.decode('{javascript_style:"Look ma, no quotes!"}', strict=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/demjson.py", line 5701, in decode
    return_stats=(return_stats or write_stats) )
  File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/site-packages/demjson.py", line 4917, in decode
    raise errors[0]
demjson.JSONDecodeError: ('JSON does not allow identifiers to be used as strings', u'javascript_style')

Using a regular expression you can try to regex your way to valid JSON; this can lead to false positives however. The pattern would be:

import re

valid_json = re.sub(r'(?<={|,)([a-zA-Z][a-zA-Z0-9]*)(?=:)', r'"\1"', jsonp_payload)

This matches a { or ,, followed by a JavaScript identifier (a character, followed by more characters or digits), and followed directly by a : colon. If your quoted values contain any such patterns, you'll get invalid JSON.

like image 133
Martijn Pieters Avatar answered Oct 23 '22 19:10

Martijn Pieters


You can also do it (in this particular case) with simple Regex:

ll = '{type:"storageCurrentGen",sizes:\n[{size:"i2.xlarge",vCPU:"4",ECU:"14",memoryGiB:"30.5",storageGB:"1 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"0.938"}}]},\n{size:"i2.2xlarge",vCPU:"8",ECU:"27",memoryGiB:"61",storageGB:"2 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"1.876"}}]},\n{size:"i2.4xlarge",vCPU:"16",ECU:"53",memoryGiB:"122",storageGB:"4 x 800 SSD",valueColumns:[{name:"linux",prices:{USD:"3.751"}}]},'

ll_patched = re.sub('([{,:])(\w+)([},:])','\\1\"\\2\"\\3',ll)
>>> ll_patched
'{"type":"storageCurrentGen","sizes":\n[{"size":"i2.xlarge","vCPU":"4","ECU":"14","memoryGiB":"30.5","storageGB":"1 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"0.938"}}]},\n{"size":"i2.2xlarge","vCPU":"8","ECU":"27","memoryGiB":"61","storageGB":"2 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"1.876"}}]},\n{"size":"i2.4xlarge","vCPU":"16","ECU":"53","memoryGiB":"122","storageGB":"4 x 800 SSD","valueColumns":[{"name":"linux","prices":{"USD":"3.751"}}]},'
like image 32
Ashalynd Avatar answered Oct 23 '22 18:10

Ashalynd