Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract nested JSON embedded as string in Pandas dataframe

Tags:

python

json

csv

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.

My CSV:

id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}

Note that not all keys are the same for all the rows. I'd like it to produce a data frame equivalent to this:

data = {'id'     : [1, 2, 3],
        'dist'  : [67, 34, 98],
        'loc_lat': [45.7, 46.89, 45.70],
        'loc_lon': [38.9, 36.7, 31.0],
        'arrival': ["Monday", "Tuesday", "NA"],
        'characteristics_body_color':["red", "blue", "yellow"],
        'characteristics_body_make':["sedan", "sedan", "NA"],
        'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)

(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )

What I've tried

After a lot of futzing around, I came up with the following solution:

#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")

#Parsing function
def parse_request(s):
    sj = json.loads(s)
    norm = json_normalize(sj)
    return norm

#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])

#Loop through and parse JSON in each row
for i in df_raw.json_request:
    parsed = parsed.append(parse_request(i))

#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)

This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?

Where I've looked

I've gone through the following related questions: Reading a CSV into pandas where one column is a json string (which seems to only work for simple, non-nested JSON)

JSON to pandas DataFrame (I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)

I'm using Python 3.3 and Pandas 0.17.

like image 783
SoHei Avatar asked Dec 04 '15 16:12

SoHei


1 Answers

Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:

import csv
import json

import pandas as pd
from pandas.io.json import json_normalize

with open('sample.csv') as fh:
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)

    # "transpose" the data. `data` is now a tuple of strings
    # containing JSON, one for each row
    idents, dists, data = zip(*rows)

data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists

So that:

>>> print(df)

   arrival characteristics.body.color characteristics.body.make  \
0   Monday                        red                     sedan   
1  Tuesday                       blue                     sedan   
2      NaN                     yellow                       NaN   

   characteristics.manuf_year  loc.lat  loc.lon ids  
0                        2014    45.70     38.9   1  
1                        2014    46.89     36.7   2  
2                        2010    45.70     31.0   3

Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:

def flatten(dct, separator='_'):
    """A fast way to flatten a dictionary,"""
    res = {}
    queue = [('', dct)]

    while queue:
        prefix, d = queue.pop()
        for k, v in d.items():
            key = prefix + k
            if not isinstance(v, dict):
                res[key] = v
            else:
                queue.append((key + separator, v))

    return res

def records_from_json(fh):
    """Yields the records from a file object."""
    rows = csv.reader(fh, delimiter='|')
    header = next(rows)
    for ident, dist, data in rows:
        rec = flatten(json.loads(data))
        rec['id'] = ident
        rec['dist'] = dist
        yield rec

def from_records(path):
    with open(path) as fh:
        return pd.DataFrame.from_records(records_from_json(fh))

And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:

        method 1 (s)  method 2 (s)  original time (s)
n_rows                                               
96          0.008217      0.002971           0.362257
192         0.014484      0.004720           0.678590
384         0.027308      0.008720           1.373918
768         0.055644      0.016175           2.791400
1536        0.105730      0.030914           5.727828
3072        0.209049      0.060105          11.877403

Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

like image 110
jme Avatar answered Nov 19 '22 03:11

jme