I have now added the current problem onto GitHib. Please find the URL for the repo. I have included a Jupyter notebook that also explains the problem. Thanks guys.
https://github.com/simongraham/dataExplore.git
I am currently working with nutritional data for a project, where the data is in raw JSON format, and I want to use python and pandas to obtain an understandable data frame. I understand that this is an easy task when the JSON is not nested. Here I would use:
nutrition = pd.read_json('data')
However I have nested information and I am finding it very difficult to get it into a reasonable data frame. The JSON format is as follows, where the nutritionNutrients element itself is a nested element. The nest for this element will describe the nutritional contents for a variety of different things such as alcohol and bcfa, as included. I have only included a sample because this is a large data file.
[
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
"vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
"vcPortionId": "1",
"vcPortionName": "1 average pepper",
"vcPortionSize": "20",
"ftEnergyKcal": 5.2,
"vcPortionUnit": "g",
"dtConsumedDate": "2016-05-04T00:00:00",
"nutritionNutrients": [
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "alcohol",
"ftValue": 0,
"vcUnit": "g",
"nPercentRI": 0,
"vcTrafficLight": ""
},
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "bcfa",
"ftValue": 0,
"vcUnit": "g",
"nPercentRI": 0,
"vcTrafficLight": ""
},
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "biotin",
"ftValue": 0,
"vcUnit": "µg",
"nPercentRI": 0,
"vcTrafficLight": ""
},
...
]
}
]
Any help would be appreciated.
Thanks.
.... ....
Now that I have found out how to solve this problem using json_normalize, I return the same problem, but this time my code is nested twice. Ie:
[
{
...
}
[,
"nutritionPortions": [
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
"vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
"vcPortionId": "1",
"vcPortionName": "1 average pepper",
"vcPortionSize": "20",
"ftEnergyKcal": 5.2,
"vcPortionUnit": "g",
"dtConsumedDate": "2016-05-04T00:00:00",
"nutritionNutrients": [
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "alcohol",
"ftValue": 0,
"vcUnit": "g",
"nPercentRI": 0,
"vcTrafficLight": ""
},
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "bcfa",
"ftValue": 0,
"vcUnit": "g",
"nPercentRI": 0,
"vcTrafficLight": ""
},
{
"vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
"vcNutrient": "biotin",
"ftValue": 0,
"vcUnit": "µg",
"nPercentRI": 0,
"vcTrafficLight": ""
},
...
}
]
}
]
When I have a JSON consisting of only nutrition data I can use:
nutrition = (pd.io
.json
.json_normalize((data, ['nutritionPortions']), 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)
However, my data does not only contain nutrition information. For example it will contain activity information, and therefore the nutrition information is nested with "nutrtitionPortions" at the start. Let's assume that all other columns are not nested and they are represented by "Activity" and "Wellbeing".
If I use the code:
nutrition = (pd.io
.json
.json_normalize(data, ['nutritionPortions'])
)
I will return to the original problem where "nutritionNutrients" is nested, but I am having no success then obtaining the corresponding data frame.
Thanks
Start by importing the json library. We use the function open to read the JSON file and then the method json. load() to parse the JSON string into a Python dictionary called superHeroSquad. That's it!
To read the files, we use read_json() function and through it, we pass the path to the JSON file we want to read. Once we do that, it returns a “DataFrame”( A table of rows and columns) that stores data.
Accessing nested json objects is just like accessing nested arrays. Nested objects are the objects that are inside an another object. In the following example 'vehicles' is a object which is inside a main object called 'person'. Using dot notation the nested objects' property(car) is accessed.
UPDATE: this should work for your kaidoData.json
file:
df = (pd.io
.json
.json_normalize(data[0]['ionPortions'], 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'dtCreatedDate','dtUpdatedDate','nProcessingStatus',
'vcPortionUnit','dtConsumedDate'
]
)
)
PS i don't know what is wrong with the 'ftEnergyKcal' - it throws me:
KeyError: 'ftEnergyKcal'
maybe it's missing in some sections
OLD answer:
use json_normalize():
(pd.io
.json
.json_normalize(l, 'nutritionNutrients',
['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)
demo:
In [107]: (pd.io
.....: .json
.....: .json_normalize(l, 'nutritionNutrients',
.....: ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
.....: 'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
.....: )
Out[107]:
ftValue nPercentRI vcNutrient vcNutritionPortionId vcTrafficLight ... vcPortionSize \
0 0 0 alcohol 478d1905-f264-4d... ... 20
1 0 0 bcfa 478d1905-f264-4d... ... 20
2 0 0 biotin 478d1905-f264-4d... ... 20
vcNutritionId vcPortionId ftEnergyKcal vcPortionName
0 2476378b-79ee-48... 1 5.2 1 average pepper
1 2476378b-79ee-48... 1 5.2 1 average pepper
2 2476378b-79ee-48... 1 5.2 1 average pepper
[3 rows x 14 columns]
where l
is your list (parsed JSON)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With