Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse non-standard semicolon separated "JSON"

I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?

{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}
like image 760
morefree Avatar asked Dec 24 '22 02:12

morefree


1 Answers

Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:

import tokenize
import json

corrected = []

with open('semi.json', 'r') as semi:
    for token in tokenize.generate_tokens(semi.readline):
        if token[0] == tokenize.OP and token[1] == ';':
            corrected.append(',')
        else:
            corrected.append(token[1])

data = json.loads(''.join(corrected))

This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.

Demo:

>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
...   "client" : "someone";
...   "server" : ["s1"; "s2"];
...   "timestamp" : 1000000;
...   "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
...     for token in tokenize.generate_tokens(semi.readline):
...         if token[0] == tokenize.OP and token[1] == ';':
...             corrected.append(',')
...         else:
...             corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}

Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.

like image 68
Martijn Pieters Avatar answered Dec 26 '22 16:12

Martijn Pieters