I am trying to decode a large utf-8 json file (2.2 GB). I load the file like so:
f = codecs.open('output.json', encoding='utf-8')
data = f.read()
If I try to do any of: json.load
, json.loads
or json.JSONDecoder().raw_decode
I get the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-40-fc2255017b19> in <module>()
----> 1 j = jd.decode(data)
/usr/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
367 end = _w(s, end).end()
368 if end != len(s):
--> 369 raise ValueError(errmsg("Extra data", s, end, len(s)))
370 return obj
371
ValueError: Extra data: line 1 column -2065998994 - line 1 column 2228968302
(char -2065998994 - 2228968302)
uname -m
shows x86_64
and
> python -c 'import sys;print("%x" % sys.maxsize, sys.maxsize > 2**32)'
('7fffffffffffffff', True)`
so I should be on 64 bit and the integer size shouldn't be a problem.
However, if I run:
jd = json.JSONDecoder()
len(data) # 2228968302
j = jd.raw_decode(data)
j[1] # 2228968302
The second value in the tuple returned by raw_decode
is the end of the string, so raw_decode
seems to parse the entire file with seemingly no garbage at the end.
So, is there something I should be doing differently with the json? Is the raw_decode
actually decoding the entire file? Why is json.load(s)
failing?
I'd add this as a comment, but the formatting capabilities in comments are too limited.
Staring at the source code,
raise ValueError(errmsg("Extra data", s, end, len(s)))
calls this function:
def errmsg(msg, doc, pos, end=None):
...
fmt = '{0}: line {1} column {2} - line {3} column {4} (char {5} - {6})'
return fmt.format(msg, lineno, colno, endlineno, endcolno, pos, end)
The (char {5} - {6})
part of the format is this part of the error message you showed:
(char -2065998994 - 2228968302)
So, in errmsg()
, pos
is -2065998994 and end
is 2228968302. Behold! ;-):
>>> pos = -2065998994
>>> end = 2228968302
>>> 2**32 + pos
2228968302L
>>> 2**32 + pos == end
True
That is, pos
and end
are "really" the same. Back from where errmsg()
was called, that means end
and len(s)
are really the same too - but end
is being viewed as a 32-bit signed integer. end
in turn comes from a regular expression match object's end()
method.
So the real problem here appears to be a 32-bit limitation/assumption in the regexp engine. I encourage you to open a bug report!
Later: to answer your questions, yes, raw_decode()
is decoding the entire file. The other methods call raw_decode()
, but add the (failing!) sanity checks afterwards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With