I am trying to get the nested values from this below json data.
{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}
{
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}
I just loaded the above data in to my pandas data frame using:
with open("file path") as f:
df = pd.DataFrame(json.loads(line) for line in f)
Now my data frame looks like this:
address Phone
0 {u'region': u'NY', u'street-address': u'147 We...
1 {u'region': u'CA', u'street-address': u'300 S ...
id name region_id
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
I could get the column subset using this - data = df[['id', 'name']]
But not sure how I can get the values of region
and street-address
along with id
and name
. My output data frame should have id, name, region, street-address
.
Note : I tried to pop out and concatenate this nested column address
with my data frame. But since my data is huge - 348MB, concatenate takes huge memory when I try column-wise - (axis - 1).
Also I am looking for an efficient way to handle this, should I go with Numpy which will directly use the C extension. Or writing in to some database like MongoDB. I am considering this because after subsetting this data, I need to join this other dataset based on the id column to get few other fields.
The following would work (however, I added a more efficient solution below; just scroll down to the EDIT):
import pandas as pd
# read the updated json file
df = pd.read_json('data.json')
# convert column with the nested json structure
tempdf = pd.concat([pd.DataFrame.from_dict(item, orient='index').T for item in df.address])
# get rid of the converted column
df.drop('address', 1, inplace=True)
# prepare concat
tempdf.index = df.index
# merge the two dataframes back together
df = pd.concat([df, tempdf], axis=1)
Output:
id name phone region_id \
0 113317 Casablanca Hotel Times Square 60763
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
region street-address postal-code locality
0 NY 147 West 43rd Street 10036 New York City
1 CA 300 S Doheny Dr 90048 Los Angeles
Now you can get rid of unwanted columns using the drop
command.
I modified your json file which was actually not valid; you can check it e.g. on JSONLint:
[{
"region_id": 60763,
"phone": "",
"address": {
"region": "NY",
"street-address": "147 West 43rd Street",
"postal-code": "10036",
"locality": "New York City"
},
"id": 113317,
"name": "Casablanca Hotel Times Square"
}, {
"region_id": 32655,
"phone": "",
"address": {
"region": "CA",
"street-address": "300 S Doheny Dr",
"postal-code": "90048",
"locality": "Los Angeles"
},
"id": 76049,
"name": "Four Seasons Hotel Los Angeles at Beverly Hills"
}]
EDIT
Building up on @MaxU's answer (which did not work for me), you could also do the following:
import pandas as pd
import ujson
from pandas.io.json import json_normalize
# this is the json file from above
with open('data.json') as f:
data = ujson.load(f)
Now, as proposed by @MaxU, you can use json_normalize to get rid of the nested structure:
df3 = json_normalize(data)
This gives you:
address.locality address.postal-code address.region address.street-address id name phone region_id
0 New York City 10036 NY 147 West 43rd Street 113317 Casablanca Hotel Times Square 60763
1 Los Angeles 90048 CA 300 S Doheny Dr 76049 Four Seasons Hotel Los Angeles at Beverly Hills 32655
You can rename the columns you want to keep like this:
df3.rename(columns={'address.region': 'region', 'address.street-address': 'street-address'}, inplace=True)
and then choose the columns you would like to keep:
df3 = df3[['id', 'name', 'region', 'street-address']]
which gives you the desired output:
id name region street-address
0 113317 Casablanca Hotel Times Square NY 147 West 43rd Street
1 76049 Four Seasons Hotel Los Angeles at Beverly Hills CA 300 S Doheny Dr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With