Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Trying to Deserialize Multiple JSON objects in a file with each object spanning multiple but consistently spaced number of lines

Tags:

Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these):

{
"zipcode":"00544",
"current":{"canwc":null,"cig":7000,"class":"observation"},
"triggers":[178,30,176,103,179,112,21,20,48,7,50,40,57]
}
{
"zipcode":"00601",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[12,23,34,28,100]
}
{
"zipcode":"00602",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[13,85,43,101,38,31]
}

I know how to work with JSON objects using the Python json library but I'm having a challenge with how to create 50 thousand different json objects from reading the file. (Perhaps I'm not even thinking about this correctly but ultimately I need to deserialize and load into a database) I've tried itertools thinking that I need a generator so I was able to use:

with open(file) as f:
    for line in itertools.islice(f, 0, 7): #since every 7 lines is a json object
        jfile = json.load(line)

But the above obviously won't work since it is not reading the 7 lines as a single json object and I'm also not sure how to then iterate on entire file and load individual json objects.

The following would give me a list I can slice:

list(open(file))[:7]

Any help would be really appreciated.


Extemely close to what I need and I think literally one step away but still struggling a little with iteration. This will finally get me an iterative printout of all of the dataframes but how do I make it so that I can capture one giant dataframe with all of the pieces essentially concatenated? I could then export that final dataframe to csv etc. (Also is there a better way to upload this result into a database rather than creating a giant dataframe first?)

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

def flatten(jfile):
    for k, v in jfile.items():
        if isinstance(v, list):
            jfile[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                jfile['%s' % (kk)] = vv
            del jfile[k]
            return jfile

with open('deadzips.json') as f:
    for chunk in lines_per_n(f, 7):
        try:
            jfile = json.loads(chunk)
            pd.DataFrame(flatten(jfile).items())
        except ValueError, e:
            pass
        else:
            pass
like image 816
horatio1701d Avatar asked Dec 05 '13 13:12

horatio1701d


2 Answers

Load 6 extra lines instead, and pass the string to json.loads():

with open(file) as f:
    for line in f:
        # slice the next 6 lines from the iterable, as a list.
        lines = [line] + list(itertools.islice(f, 6))
        jfile = json.loads(''.join(lines))

        # do something with jfile

json.load() will slurp up more than just the next object in the file, and islice(f, 0, 7) would read only the first 7 lines, rather than read the file in 7-line blocks.

You can wrap reading a file in blocks of size N in a generator:

from itertools import islice, chain

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

then use that to chunk up your input file:

with open(file) as f:
    for chunk in lines_per_n(f, 7):
        jfile = json.loads(chunk)

        # do something with jfile

Alternatively, if your blocks turn out to be of variable length, read until you have something that parses:

with open(file) as f:
    for line in f:
        while True:
            try:
                jfile = json.loads(line)
                break
            except ValueError:
                # Not yet a complete JSON value
                line += next(f)

        # do something with jfile
like image 56
Martijn Pieters Avatar answered Oct 14 '22 16:10

Martijn Pieters


As stated elsewhere, a general solution is to read the file in pieces, append each piece to the last, and try to parse that new chunk. If it doesn't parse, continue until you get something that does. Once you have something that parses, return it, and restart the process. Rinse-lather-repeat until you run out of data.

Here is a succinct generator that will do this:

def load_json_multiple(segments):
    chunk = ""
    for segment in segments:
        chunk += segment
        try:
            yield json.loads(chunk)
            chunk = ""
        except ValueError:
            pass

Use it like this:

with open('foo.json') as f:
   for parsed_json in load_json_multiple(f):
      print parsed_json

I hope this helps.

like image 32
Jeff Younker Avatar answered Oct 14 '22 17:10

Jeff Younker