Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas slicing nested values along with other columns

I am trying to get the nested values from this below json data.

{
    "region_id": 60763,
    "phone": "",
    "address": {
        "region": "NY",
        "street-address": "147 West 43rd Street",
        "postal-code": "10036",
        "locality": "New York City"
    },
    "id": 113317,
    "name": "Casablanca Hotel Times Square"
}
{
    "region_id": 32655,
    "phone": "",
    "address": {
        "region": "CA",
        "street-address": "300 S Doheny Dr",
        "postal-code": "90048",
        "locality": "Los Angeles"
    },
    "id": 76049,
    "name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}

I just loaded the above data in to my pandas data frame using:

with open("file path") as f:
    df = pd.DataFrame(json.loads(line) for line in f)

Now my data frame looks like this:

   address                                              Phone
0  {u'region': u'NY', u'street-address': u'147 We...      
1  {u'region': u'CA', u'street-address': u'300 S ...   

       id                                             name  region_id
0  113317                    Casablanca Hotel Times Square  60763   
1   76049  Four Seasons Hotel Los Angeles at Beverly Hills  32655   

I could get the column subset using this - data = df[['id', 'name']]

But not sure how I can get the values of region and street-address along with id and name. My output data frame should have id, name, region, street-address.

Note : I tried to pop out and concatenate this nested column address with my data frame. But since my data is huge - 348MB, concatenate takes huge memory when I try column-wise - (axis - 1).

Also I am looking for an efficient way to handle this, should I go with Numpy which will directly use the C extension. Or writing in to some database like MongoDB. I am considering this because after subsetting this data, I need to join this other dataset based on the id column to get few other fields.

like image 234
ds_user Avatar asked Jan 07 '23 15:01

ds_user


1 Answers

The following would work (however, I added a more efficient solution below; just scroll down to the EDIT):

import pandas as pd

# read the updated json file
df = pd.read_json('data.json')

# convert column with the nested json structure
tempdf = pd.concat([pd.DataFrame.from_dict(item, orient='index').T for item in df.address])

# get rid of the converted column
df.drop('address', 1, inplace=True)

# prepare concat
tempdf.index = df.index

# merge the two dataframes back together
df = pd.concat([df, tempdf], axis=1)

Output:

       id                                             name phone  region_id  \
0  113317                    Casablanca Hotel Times Square            60763   
1   76049  Four Seasons Hotel Los Angeles at Beverly Hills            32655   

  region        street-address postal-code       locality  
0     NY  147 West 43rd Street       10036  New York City  
1     CA       300 S Doheny Dr       90048    Los Angeles 

Now you can get rid of unwanted columns using the drop command.

I modified your json file which was actually not valid; you can check it e.g. on JSONLint:

[{
    "region_id": 60763,
    "phone": "",
    "address": {
        "region": "NY",
        "street-address": "147 West 43rd Street",
        "postal-code": "10036",
        "locality": "New York City"
    },
    "id": 113317,
    "name": "Casablanca Hotel Times Square"
}, {
    "region_id": 32655,
    "phone": "",
    "address": {
        "region": "CA",
        "street-address": "300 S Doheny Dr",
        "postal-code": "90048",
        "locality": "Los Angeles"
    },
    "id": 76049,
    "name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}]

EDIT

Building up on @MaxU's answer (which did not work for me), you could also do the following:

import pandas as pd
import ujson
from pandas.io.json import json_normalize

# this is the json file from above
with open('data.json') as f:
    data = ujson.load(f)

Now, as proposed by @MaxU, you can use json_normalize to get rid of the nested structure:

df3 = json_normalize(data)

This gives you:

  address.locality address.postal-code address.region address.street-address      id                                             name phone  region_id
0    New York City               10036             NY   147 West 43rd Street  113317                    Casablanca Hotel Times Square            60763
1      Los Angeles               90048             CA        300 S Doheny Dr   76049  Four Seasons Hotel Los Angeles at Beverly Hills            32655

You can rename the columns you want to keep like this:

df3.rename(columns={'address.region': 'region', 'address.street-address': 'street-address'}, inplace=True)

and then choose the columns you would like to keep:

df3 = df3[['id', 'name', 'region', 'street-address']]

which gives you the desired output:

       id                                             name region        street-address
0  113317                    Casablanca Hotel Times Square     NY  147 West 43rd Street
1   76049  Four Seasons Hotel Los Angeles at Beverly Hills     CA       300 S Doheny Dr
like image 100
Cleb Avatar answered Jan 13 '23 09:01

Cleb