I have a non-standard "JSON" file to parse. Each item is semicolon separated instead of comma separated. I can't simply replace ;
with ,
because there might be some value containing ;
, ex. "hello; world". How can I parse this into the same structure that JSON would normally parse it?
{
"client" : "someone";
"server" : ["s1"; "s2"];
"timestamp" : 1000000;
"content" : "hello; world";
...
}
Use the Python tokenize
module to transform the text stream to one with commas instead of semicolons. The Python tokenizer is happy to handle JSON input too, even including semicolons. The tokenizer presents strings as whole tokens, and 'raw' semicolons are in the stream as single token.OP
tokens for you to replace:
import tokenize
import json
corrected = []
with open('semi.json', 'r') as semi:
for token in tokenize.generate_tokens(semi.readline):
if token[0] == tokenize.OP and token[1] == ';':
corrected.append(',')
else:
corrected.append(token[1])
data = json.loads(''.join(corrected))
This assumes that the format becomes valid JSON once you've replaced the semicolons with commas; e.g. no trailing commas before a closing ]
or }
allowed, although you could even track the last comma added and remove it again if the next non-newline token is a closing brace.
Demo:
>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
... "client" : "someone";
... "server" : ["s1"; "s2"];
... "timestamp" : 1000000;
... "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
... for token in tokenize.generate_tokens(semi.readline):
... if token[0] == tokenize.OP and token[1] == ';':
... corrected.append(',')
... else:
... corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}
Inter-token whitespace was dropped, but could be re-instated by paying attention to the tokenize.NL
tokens and the (lineno, start)
and (lineno, end)
position tuples that are part of each token. Since the whitespace around the tokens doesn't matter to a JSON parser, I've not bothered with this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With