Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Yajl parse error with githubarchive.org JSON stream in Python

Tags:

python

json

yajl

I'm trying to parse a GitHub archive file with yajl-py. I believe the basic format of the file is a stream of JSON objects, so the file itself is not valid JSON, but it contains objects which are.

To test this out, I installed yajl-py and then used their example parser (from https://github.com/pykler/yajl-py/blob/master/examples/yajl_py_example.py) to try to parse a file:

python yajl_py_example.py < 2012-03-12-0.json

where 2012-03-12-0.json is one of the GitHub archive files that's been decompressed.

It appears this sort of thing should work from their reference implementation in Ruby. Do the Python packages not handle JSON streams?

By the way, here's the error I get:

yajl.yajl_common.YajlError: parse error: trailing garbage
          9478bbc3","type":"PushEvent"}{"repository":{"url":"https://g
                     (right here) ------^
like image 922
Bialecki Avatar asked May 03 '12 13:05

Bialecki


3 Answers

You need to use a stream parser to read the data. Yajl supports stream parsing, which allows you to read one object at a time from a file/stream. Having said that, it doesn't look like Python has working bindings for Yajl..

py-yajl has iterload commented out, not sure why: https://github.com/rtyler/py-yajl/commit/a618f66005e9798af848c15d9aa35c60331e6687#L1R264

Not a Python solution, but you can use Ruby bindings to read in the data and emit it in a format you need:

# gem install yajl-ruby

require 'open-uri'
require 'zlib'
require 'yajl'

gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz')
js = Zlib::GzipReader.new(gz).read

Yajl::Parser.parse(js) do |event|
  print event
end
like image 54
igrigorik Avatar answered Oct 29 '22 13:10

igrigorik


The example does not enable any of the Yajl extra features, for what you are looking for you need to enable allow_multiple_values flag on the parser. Here is what you need to modify to the basic example to have it parse your file.

--- a/examples/yajl_py_example.py
+++ b/examples/yajl_py_example.py
@@ -37,6 +37,7 @@ class ContentHandler(YajlContentHandler):

 def main(args):
     parser = YajlParser(ContentHandler())
+    parser.allow_multiple_values = True
     if args:
         for fn in args:
             f = open(fn)

Yajl-Py is a thin wrapper around yajl, so you can use all the features Yajl provides. Here are all the flags that yajl provides that you can enable:

yajl_allow_comments
yajl_dont_validate_strings
yajl_allow_trailing_garbage
yajl_allow_multiple_values
yajl_allow_partial_values

To turn these on in yajl-py you do the following:

parser = YajlParser(ContentHandler())
# enabling these features, note that to make it more pythonic, the prefix `yajl_` was removed
parser.allow_comments = True
parser.dont_validate_strings = True
parser.allow_trailing_garbage = True
parser.allow_multiple_values = True
parser.allow_partial_values = True
# then go ahead and parse
parser.parse()
like image 1
Pykler Avatar answered Oct 29 '22 12:10

Pykler


I know this has been answered, but I prefer the following approach and it does not use any packages. The github dictionary is on a single line for some reason, so you cannot assume a single dictionary per line. This looks like:

{"json-key":"json-val", "sub-dict":{"sub-key":"sub-val"}}{"json-key2":"json-val2", "sub-dict2":{"sub-key2":"sub-val2"}}

I decided to create a function which fetches one dictionary at a time. It returns json as a string.

def read_next_dictionary(f):
    depth = 0
    json_str = ""
    while True:
        c = f.read(1)
        if not c:
            break #EOF
        json_str += str(c)
        if c == '{':
            depth += 1
        elif c == '}':
            depth -= 1

        if depth == 0:
            break

    return json_str

I used this function to loop through the Github archive with a while loop:

arr_of_dicts = []
f = open(file_path)
while True:
    json_as_str = read_next_dictionary(f)
    try:
        json_dict = json.loads(json_as_str)
        arr_of_dicts.append(json_dict)
    except: 
        break # exception on loading json to end loop

pprint.pprint(arr_of_dicts)

This works on the dataset post here: http://www.githubarchive.org/ (after gunzip)

like image 1
rirwin Avatar answered Oct 29 '22 12:10

rirwin