Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using pandas and json_normalize to flatten nested JSON API response

I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.

A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):

{
    "per_page": 2,
    "total": 1,
    "data": [{
            "total_time": 0,
            "collection_mode": "default",
            "href": "https://api.surveymonkey.com/v3/responses/5007154325",
            "custom_variables": {
                "custvar_1": "one",
                "custvar_2": "two"
            },
            "custom_value": "custom identifier for the response",
            "edit_url": "https://www.surveymonkey.com/r/",
            "analyze_url": "https://www.surveymonkey.com/analyze/browse/",
            "ip_address": "",
            "pages": [
                {
                    "id": "103332310",
                    "questions": [{
                            "answers": [{
                                    "choice_id": "3057839051"
                                }
                            ],
                            "id": "319352786"
                        }
                    ]
                },
                {
                    "id": "44783164",
                    "questions": [{
                            "id": "153745381",
                            "answers": [{
                                    "text": "some_name"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "44783183",
                    "questions": [{
                            "id": "153745436",
                            "answers": [{
                                    "col_id": "1087201352",
                                    "choice_id": "1087201369",
                                    "row_id": "1087201362"
                                }, {
                                    "col_id": "1087201353",
                                    "choice_id": "1087201373",
                                    "row_id": "1087201362"
                                }
                                ]
                            }
                        ]
                }
            ],
            "date_modified": "1970-01-17T19:07:34+00:00",
            "response_status": "completed",
            "id": "5007154325",
            "collector_id": "50253586",
            "recipient_id": "0",
            "date_created": "1970-01-17T19:07:34+00:00",
            "survey_id": "105723396"
        }
    ],
    "page": 1,
    "links": {
        "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
    }
}

I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:

    choice_id      col_id      row_id       text   question_id       page_id      response_id
0  3057839051         NaN         NaN        NaN     319352786     103332310       5007154325
1         NaN         NaN         NaN  some_name     153745381      44783164       5007154325
2  1087201369  1087201352  1087201362        NaN     153745436      44783183       5007154325
3  1087201373  1087201353  1087201362        NaN     153745436      44783183       5007154325

I can get close by running the following code (Python 3.6):

df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)

Which returns:

                                    question_answers question_id          id
0                      [{'choice_id': '3057839051'}]   319352786  5007154325
1                            [{'text': 'some_name'}]   153745381  5007154325
2  [{'col_id': '1087201352', 'choice_id': '108720...   153745436  5007154325

But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:

answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)

Returns:

    choice_id      col_id      row_id       text          id questions.id   pages.id
0  3057839051         NaN         NaN        NaN  5007154325    103332310  103332310
1         NaN         NaN         NaN  some_name  5007154325     44783164   44783164
2  1087201369  1087201352  1087201362        NaN  5007154325     44783183   44783183
3  1087201373  1087201353  1087201362        NaN  5007154325     44783183   44783183

A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.

I'm sure this is possible, but I can't get there. Any examples of how to do this?

Additional context: I'm trying to create a dataframe of SurveyMonkey API response output.

My long term goal is to re-create the "all responses" excel sheet that their export service provides.

I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output.

I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

like image 514
user2752159 Avatar asked Nov 07 '18 22:11

user2752159


People also ask

How do I flatten a nested JSON file?

Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables. Parameters: data – dict or list of dicts.

How do you flatten a JSON response in Python?

Approach to flatten JSON: There are many ways to flatten JSON. There is one recursive way and another by using the json-flatten library. Now we can flatten the dictionary array by a recursive approach which is quite easy to understand. The recursive approach is a bit slower than using the json-flatten library.

What does json_normalize return?

This package contains a function, json_normalize. It will take a json-like structure and convert it to a map object which returns dicts. Output dicts will have their path joined by ".", this can of course be customized.


2 Answers

You need to modify the meta parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename:

answers_df = json_normalize(data=so_survey_responses['data'],
                        record_path=['pages', 'questions', 'answers'],
                        meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
        columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})
like image 179
y.luis Avatar answered Sep 27 '22 17:09

y.luis


There is no way to do this in a completely generic way using json_normalize(). You can use the record_path and meta arguments to indicate how you want the JSON to be processed.

However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

like image 28
Abhinav Sood Avatar answered Sep 27 '22 18:09

Abhinav Sood