Parse non-standard semicolon separated "JSON"

Question

I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ; with , because there might be some value containing ;, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?

{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

Martijn Pieters · Accepted Answer

Use the Python tokenize module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP tokens for you to replace:

import tokenize
import json

corrected = []

with open('semi.json', 'r') as semi:
    for token in tokenize.generate_tokens(semi.readline):
        if token[0] == tokenize.OP and token[1] == ';':
            corrected.append(',')
        else:
            corrected.append(token[1])

data = json.loads(''.join(corrected))

This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ] or } allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.

Demo:

>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
...   "client" : "someone";
...   "server" : ["s1"; "s2"];
...   "timestamp" : 1000000;
...   "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
...     for token in tokenize.generate_tokens(semi.readline):
...         if token[0] == tokenize.OP and token[1] == ';':
...             corrected.append(',')
...         else:
...             corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}

Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL tokens and the (lineno, start) and (lineno, end) position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.

Parse non-standard semicolon separated "JSON"

Tags:

python

json

parsing

morefree

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Parse non-standard semicolon separated "JSON"

Tags:

python

json

parsing

morefree

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us