I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.
A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):
{
"per_page": 2,
"total": 1,
"data": [{
"total_time": 0,
"collection_mode": "default",
"href": "https://api.surveymonkey.com/v3/responses/5007154325",
"custom_variables": {
"custvar_1": "one",
"custvar_2": "two"
},
"custom_value": "custom identifier for the response",
"edit_url": "https://www.surveymonkey.com/r/",
"analyze_url": "https://www.surveymonkey.com/analyze/browse/",
"ip_address": "",
"pages": [
{
"id": "103332310",
"questions": [{
"answers": [{
"choice_id": "3057839051"
}
],
"id": "319352786"
}
]
},
{
"id": "44783164",
"questions": [{
"id": "153745381",
"answers": [{
"text": "some_name"
}
]
}
]
},
{
"id": "44783183",
"questions": [{
"id": "153745436",
"answers": [{
"col_id": "1087201352",
"choice_id": "1087201369",
"row_id": "1087201362"
}, {
"col_id": "1087201353",
"choice_id": "1087201373",
"row_id": "1087201362"
}
]
}
]
}
],
"date_modified": "1970-01-17T19:07:34+00:00",
"response_status": "completed",
"id": "5007154325",
"collector_id": "50253586",
"recipient_id": "0",
"date_created": "1970-01-17T19:07:34+00:00",
"survey_id": "105723396"
}
],
"page": 1,
"links": {
"self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
}
}
I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:
choice_id col_id row_id text question_id page_id response_id
0 3057839051 NaN NaN NaN 319352786 103332310 5007154325
1 NaN NaN NaN some_name 153745381 44783164 5007154325
2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325
3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325
I can get close by running the following code (Python 3.6):
df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)
Which returns:
question_answers question_id id
0 [{'choice_id': '3057839051'}] 319352786 5007154325
1 [{'text': 'some_name'}] 153745381 5007154325
2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325
But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:
answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)
Returns:
choice_id col_id row_id text id questions.id pages.id
0 3057839051 NaN NaN NaN 5007154325 103332310 103332310
1 NaN NaN NaN some_name 5007154325 44783164 44783164
2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183
3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183
A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.
I'm sure this is possible, but I can't get there. Any examples of how to do this?
Additional context: I'm trying to create a dataframe of SurveyMonkey API response output.
My long term goal is to re-create the "all responses" excel sheet that their export service provides.
I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output.
I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.
Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables. Parameters: data – dict or list of dicts.
Approach to flatten JSON: There are many ways to flatten JSON. There is one recursive way and another by using the json-flatten library. Now we can flatten the dictionary array by a recursive approach which is quite easy to understand. The recursive approach is a bit slower than using the json-flatten library.
This package contains a function, json_normalize. It will take a json-like structure and convert it to a map object which returns dicts. Output dicts will have their path joined by ".", this can of course be customized.
You need to modify the meta
parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename
:
answers_df = json_normalize(data=so_survey_responses['data'],
record_path=['pages', 'questions', 'answers'],
meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})
There is no way to do this in a completely generic way using json_normalize()
. You can use the record_path
and meta
arguments to indicate how you want the JSON to be processed.
However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With