I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize. A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post): <pre class="prettyprint"><code>{ "per_page": 2, "total": 1, "data": [{ "total_time": 0, "collection_mode": "default", "href": "https://api.surveymonkey.com/v3/responses/5007154325", "custom_variables": { "custvar_1": "one", "custvar_2": "two" }, "custom_value": "custom identifier for the response", "edit_url": "https://www.surveymonkey.com/r/", "analyze_url": "https://www.surveymonkey.com/analyze/browse/", "ip_address": "", "pages": [ { "id": "103332310", "questions": [{ "answers": [{ "choice_id": "3057839051" } ], "id": "319352786" } ] }, { "id": "44783164", "questions": [{ "id": "153745381", "answers": [{ "text": "some_name" } ] } ] }, { "id": "44783183", "questions": [{ "id": "153745436", "answers": [{ "col_id": "1087201352", "choice_id": "1087201369", "row_id": "1087201362" }, { "col_id": "1087201353", "choice_id": "1087201373", "row_id": "1087201362" } ] } ] } ], "date_modified": "1970-01-17T19:07:34+00:00", "response_status": "completed", "id": "5007154325", "collector_id": "50253586", "recipient_id": "0", "date_created": "1970-01-17T19:07:34+00:00", "survey_id": "105723396" } ], "page": 1, "links": { "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2" } } </code></pre> I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this: <pre class="prettyprint"><code> choice_id col_id row_id text question_id page_id response_id 0 3057839051 NaN NaN NaN 319352786 103332310 5007154325 1 NaN NaN NaN some_name 153745381 44783164 5007154325 2 1087201369 1087201352 1087201362 NaN 153745436 44783183 5007154325 3 1087201373 1087201353 1087201362 NaN 153745436 44783183 5007154325 </code></pre> I can get close by running the following code (Python 3.6): <pre class="prettyprint"><code>df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_') print(df) </code></pre> Which returns: <pre class="prettyprint"><code> question_answers question_id id 0 [{'choice_id': '3057839051'}] 319352786 5007154325 1 [{'text': 'some_name'}] 153745381 5007154325 2 [{'col_id': '1087201352', 'choice_id': '108720... 153745436 5007154325 </code></pre> But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values: <pre class="prettyprint"><code>answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']]) print(answers_df) </code></pre> Returns: <pre class="prettyprint"><code> choice_id col_id row_id text id questions.id pages.id 0 3057839051 NaN NaN NaN 5007154325 103332310 103332310 1 NaN NaN NaN some_name 5007154325 44783164 44783164 2 1087201369 1087201352 1087201362 NaN 5007154325 44783183 44783183 3 1087201373 1087201353 1087201362 NaN 5007154325 44783183 44783183 </code></pre> A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data. I'm sure this is possible, but I can't get there. Any examples of how to do this? Additional context: I'm trying to create a dataframe of SurveyMonkey API response output. My long term goal is to re-create the "all responses" excel sheet that their export service provides. I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output. I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

You need to modify the <code>meta</code> parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with <code>rename</code>: <pre class="prettyprint"><code>answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\ .rename(index=str, columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'}) </code></pre>

There is no way to do this in a completely generic way using <code>json_normalize()</code>. You can use the <code>record_path</code> and <code>meta</code> arguments to indicate how you want the JSON to be processed. However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

Using pandas and json_normalize to flatten nested JSON API response

Tags:

python

json

pandas

surveymonkey

I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.

A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):

{
    "per_page": 2,
    "total": 1,
    "data": [{
            "total_time": 0,
            "collection_mode": "default",
            "href": "https://api.surveymonkey.com/v3/responses/5007154325",
            "custom_variables": {
                "custvar_1": "one",
                "custvar_2": "two"
            },
            "custom_value": "custom identifier for the response",
            "edit_url": "https://www.surveymonkey.com/r/",
            "analyze_url": "https://www.surveymonkey.com/analyze/browse/",
            "ip_address": "",
            "pages": [
                {
                    "id": "103332310",
                    "questions": [{
                            "answers": [{
                                    "choice_id": "3057839051"
                                }
                            ],
                            "id": "319352786"
                        }
                    ]
                },
                {
                    "id": "44783164",
                    "questions": [{
                            "id": "153745381",
                            "answers": [{
                                    "text": "some_name"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "44783183",
                    "questions": [{
                            "id": "153745436",
                            "answers": [{
                                    "col_id": "1087201352",
                                    "choice_id": "1087201369",
                                    "row_id": "1087201362"
                                }, {
                                    "col_id": "1087201353",
                                    "choice_id": "1087201373",
                                    "row_id": "1087201362"
                                }
                                ]
                            }
                        ]
                }
            ],
            "date_modified": "1970-01-17T19:07:34+00:00",
            "response_status": "completed",
            "id": "5007154325",
            "collector_id": "50253586",
            "recipient_id": "0",
            "date_created": "1970-01-17T19:07:34+00:00",
            "survey_id": "105723396"
        }
    ],
    "page": 1,
    "links": {
        "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
    }
}

I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:

    choice_id      col_id      row_id       text   question_id       page_id      response_id
0  3057839051         NaN         NaN        NaN     319352786     103332310       5007154325
1         NaN         NaN         NaN  some_name     153745381      44783164       5007154325
2  1087201369  1087201352  1087201362        NaN     153745436      44783183       5007154325
3  1087201373  1087201353  1087201362        NaN     153745436      44783183       5007154325

I can get close by running the following code (Python 3.6):

df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)

Which returns:

                                    question_answers question_id          id
0                      [{'choice_id': '3057839051'}]   319352786  5007154325
1                            [{'text': 'some_name'}]   153745381  5007154325
2  [{'col_id': '1087201352', 'choice_id': '108720...   153745436  5007154325

But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:

answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)

Returns:

    choice_id      col_id      row_id       text          id questions.id   pages.id
0  3057839051         NaN         NaN        NaN  5007154325    103332310  103332310
1         NaN         NaN         NaN  some_name  5007154325     44783164   44783164
2  1087201369  1087201352  1087201362        NaN  5007154325     44783183   44783183
3  1087201373  1087201353  1087201362        NaN  5007154325     44783183   44783183

A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.

I'm sure this is possible, but I can't get there. Any examples of how to do this?

Additional context: I'm trying to create a dataframe of SurveyMonkey API response output.

My long term goal is to re-create the "all responses" excel sheet that their export service provides.

I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output.

I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

514

asked Nov 07 '18 22:11

user2752159

2 Answers

You need to modify the meta parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename:

answers_df = json_normalize(data=so_survey_responses['data'],
                        record_path=['pages', 'questions', 'answers'],
                        meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
        columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})

179

answered Sep 27 '22 17:09

y.luis

There is no way to do this in a completely generic way using json_normalize(). You can use the record_path and meta arguments to indicate how you want the JSON to be processed.

However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

answered Sep 27 '22 18:09

Abhinav Sood

Related questions
                            
                                Issues importing mlxtend python
                            
                                doing "nothing" in else command of if-else clause [duplicate]
                            
                                How to use flask context with concurrent.futures.ThreadPoolExecutor
                            
                                Drop duplicates, but ignore nulls
                            
                                adding static() to urlpatterns only work by appending to the list
                            
                                Unable to print names in the right way in another function
                            
                                Dividing each row by the previous one
                            
                                Merge two columns into one within the same data frame in pandas/python
                            
                                How to increase process speed using read_excel in pandas?
                            
                                Change color of individual boxes in pandas boxplot subplots
                            
                                Run bash script with Django
                            
                                PipEnv: How to handle locally installed .whl packages
                            
                                Python - matplotlib - how do I plot a plane from equation?
                            
                                Merging multiple CSV files into separate tabs of a spreadsheet in Python
                            
                                iterating markers in plots
                            
                                Installing MS C++ 14.0 for python without Visual Studio
                            
                                Get Chrome tab URL in Python
                            
                                How to convert RGB image pixels to L*a*b*?
                            
                                ImportError: cannot import name 'password_reset'
                            
                                Update dataframe based on index and append the new ones

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With