Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas json_normalize produces confusing `KeyError` message?

I'm trying to convert a nested JSON to a Pandas dataframe. I've been using json_normalize with success until I came across a certain JSON. I've made a smaller version of it to recreate the problem.

from pandas.io.json import json_normalize

json=[{"events": [{"schedule": {"date": "2015-08-27",
     "location": {"building": "BDC", "floor": 5},
     "ID": 815},
    "group": "A"},
   {"schedule": {"date": "2015-08-27",
     "location": {"building": "BDC", "floor": 5},
 "ID": 816},
"group": "A"}]}]

I then run:

json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])

Expecting to see something like this:

ID      group   schedule.date   schedule.location.building schedule.location.floor  
'815'   'A'     '2015-08-27'            'BDC'                       5
'816'   'A'     '2015-08-27'            'BDC'                       5

But instead I get this error:

In [2]: json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-b588a9e3ef1d> in <module>()
----> 1 json_normalize(json[0],'events',[['schedule','date'],['schedule','location','building'],['schedule','location','floor']])

/Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in json_normalize(data, record_path, meta, meta_prefix, record_prefix)
    739                 records.extend(recs)
    740
--> 741     _recursive_extract(data, record_path, {}, level=0)
    742
    743     result = DataFrame(records)

/Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in _recursive_extract(data, path, seen_meta, level)
    734                         meta_val = seen_meta[key]
    735                     else:
--> 736                         meta_val = _pull_field(obj, val[level:])
    737                     meta_vals[key].append(meta_val)
    738

/Users/logan/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/io/json.pyc in _pull_field(js, spec)
    674         if isinstance(spec, list):
    675             for field in spec:
--> 676                 result = result[field]
    677         else:
    678             result = result[spec]

KeyError: 'schedule'
like image 420
themachinist Avatar asked Aug 29 '15 22:08

themachinist


3 Answers

In this case, I think you'd just use this:

In [57]: json_normalize(data[0]['events'])
Out[57]: 
  group  schedule.ID schedule.date schedule.location.building  \
0     A          815    2015-08-27                        BDC   
1     A          816    2015-08-27                        BDC   

   schedule.location.floor  
0                        5  
1                        5  

The meta paths ([['schedule','date']...]) are for specifying data at the same level of nesting as your records, i.e. at the same level as 'events'. It doesn't look like json_normalize handles dicts with nested lists particularly well, so you may need to do some manual reshaping if your actual data is much more complicated.

like image 104
chrisb Avatar answered Nov 20 '22 09:11

chrisb


I got the KeyError when the structue of the json was not consistent. Meaning, when one of the nested strucutes were missing from the json, I got KeyError.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.json_normalize.html

From the examples mentioned on the pandas documentation site, if you make the nested tag (counties) missing on one of the records, you will get a KeyError. To circumvent this, you might have to make sure ignore the missing tag or consider only the records which have nested column/tag populated with data.

like image 3
Parachute Avatar answered Nov 20 '22 10:11

Parachute


I had this same problem! This thread helped, especially parachute py's answer.

I found a solution using:

df.dropna(subset = *column(s) with nested data*)

then saving the resultant df as a new json. Load the new json and now you'll be able to flatten the nested columns.

There's probably a more efficient way to get around this, but my solution works.

edit: forgot to mention, I tried using the *errors = 'ignore'* arg in json.normalize() and it didn't help.

like image 1
Jim Arnold Avatar answered Nov 20 '22 11:11

Jim Arnold