I have a json data (coming from mongodb) containing thousands of records (so an array/list of json object) with a structure like the below one for each object: <pre class="prettyprint"><code>{ "id":1, "first_name":"Mead", "last_name":"Lantaph", "email":"mlantaph0@opensource.org", "gender":"Male", "ip_address":"231.126.209.31", "nested_array_to_expand":[ { "property":"Quaxo", "json_obj":{ "prop1":"Chevrolet", "prop2":"Mercy Streets" } }, { "property":"Blogpad", "json_obj":{ "prop1":"Hyundai", "prop2":"Flashback" } }, { "property":"Yabox", "json_obj":{ "prop1":"Nissan", "prop2":"Welcome Mr. Marshall (Bienvenido Mister Marshall)" } } ] } </code></pre> When loaded in a dataframe the "nested_array_to_expand" is a string containing the json (I do use "json_normalize" during loading). The expected result is to get a dataframe with 3 row (given the above example) and new columns for the nested objects such as below: <pre class="prettyprint"><code>index email first_name gender id ip_address last_name \ 0 mlantaph0@opensource.org Mead Male 1 231.126.209.31 Lantaph 1 mlantaph0@opensource.org Mead Male 1 231.126.209.31 Lantaph 2 mlantaph0@opensource.org Mead Male 1 231.126.209.31 Lantaph test.name test.obj.ahah test.obj.buzz 0 Quaxo Mercy Streets Chevrolet 1 Blogpad Flashback Hyundai 2 Yabox Welcome Mr. Marshall (Bienvenido Mister Marshall) Nissan </code></pre> I was able to get that result with the below function but it extremely slow (around 2s for 1k records) so I would like to either improve the existing code or find a completely different approach to get this result. <pre class="prettyprint"><code>def expand_field(field, df, parent_id='id'): all_sub = pd.DataFrame() # we need an id per row to be able to merge back dataframes # if no id, then we will create one based on index of rows if parent_id not in df: df[parent_id] = df.index # go through all rows and create a new dataframe with values for i, row in df.iterrows(): try: sub = json_normalize(df[field].values[i]) sub = sub.add_prefix(field + '.') sub['parent_id'] = row[parent_id] all_sub = all_sub.append(sub) except: print('crash') pass df = pd.merge(df, all_sub, left_on=parent_id, right_on='parent_id', how='left') #remove old columns del df["parent_id"] del df[field] #return expanded dataframe return df </code></pre> Many thanks for your help. ===== EDIT for answering comment ==== The data loaded from mongodb is an array of object. I load it with the following code: <pre class="prettyprint"><code>data = json.loads(my_json_string) df = json_normalize(data) </code></pre> The output give me a dataframe with df["nested_array_to_expand"] as dtype object (string) <pre class="prettyprint"><code>0 [{'property': 'Quaxo', 'json_obj': {'prop1': '... Name: nested_array_to_expand, dtype: object </code></pre>

I propose an interesting answer I think using <code>pandas.json_normalize</code>. I use it to expand the nested <code>json</code> -- maybe there is a better way, but you definitively should consider using this feature. Then you have just to rename the columns as you want. <pre class="prettyprint"><code>import io from pandas import json_normalize # Loading the json string into a structure json_dict = json.load(io.StringIO(json_str)) df = pd.concat([pd.DataFrame(json_dict), json_normalize(json_dict['nested_array_to_expand'])], axis=1).drop('nested_array_to_expand', 1) </code></pre> <img src="https://i.stack.imgur.com/zFiwE.png" alt="enter image description here">

Pandas - expand nested json array within column in dataframe

Tags:

python

json

pandas

I have a json data (coming from mongodb) containing thousands of records (so an array/list of json object) with a structure like the below one for each object:

{
   "id":1,
   "first_name":"Mead",
   "last_name":"Lantaph",
   "email":"[email protected]",
   "gender":"Male",
   "ip_address":"231.126.209.31",
   "nested_array_to_expand":[
      {
         "property":"Quaxo",
         "json_obj":{
            "prop1":"Chevrolet",
            "prop2":"Mercy Streets"
         }
      },
      {
         "property":"Blogpad",
         "json_obj":{
            "prop1":"Hyundai",
            "prop2":"Flashback"
         }
      },
      {
         "property":"Yabox",
         "json_obj":{
            "prop1":"Nissan",
            "prop2":"Welcome Mr. Marshall (Bienvenido Mister Marshall)"
         }
      }
   ]
}

When loaded in a dataframe the "nested_array_to_expand" is a string containing the json (I do use "json_normalize" during loading). The expected result is to get a dataframe with 3 row (given the above example) and new columns for the nested objects such as below:

index   email first_name gender  id      ip_address last_name  \
0  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
1  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   
2  [email protected]       Mead   Male   1  231.126.209.31   Lantaph   

  test.name                                      test.obj.ahah test.obj.buzz  
0     Quaxo                                      Mercy Streets     Chevrolet  
1   Blogpad                                          Flashback       Hyundai  
2     Yabox  Welcome Mr. Marshall (Bienvenido Mister Marshall)        Nissan

I was able to get that result with the below function but it extremely slow (around 2s for 1k records) so I would like to either improve the existing code or find a completely different approach to get this result.

def expand_field(field, df, parent_id='id'):
    all_sub = pd.DataFrame()
    # we need an id per row to be able to merge back dataframes
    # if no id, then we will create one based on index of rows
    if parent_id not in df:
        df[parent_id] = df.index

    # go through all rows and create a new dataframe with values
    for i, row in df.iterrows():
        try:
            sub = json_normalize(df[field].values[i])
            sub = sub.add_prefix(field + '.')
            sub['parent_id'] = row[parent_id]
            all_sub = all_sub.append(sub)
        except:
            print('crash')
            pass
    df = pd.merge(df, all_sub, left_on=parent_id, right_on='parent_id', how='left')
    #remove old columns
    del df["parent_id"]
    del df[field]
    #return expanded dataframe
    return df

Many thanks for your help.

===== EDIT for answering comment ====

The data loaded from mongodb is an array of object. I load it with the following code:

data = json.loads(my_json_string)
df = json_normalize(data)

The output give me a dataframe with df["nested_array_to_expand"] as dtype object (string)

0    [{'property': 'Quaxo', 'json_obj': {'prop1': '...
Name: nested_array_to_expand, dtype: object

599

asked Dec 12 '17 04:12

Eric D.

1 Answers

I propose an interesting answer I think using pandas.json_normalize.
I use it to expand the nested json -- maybe there is a better way, but you definitively should consider using this feature. Then you have just to rename the columns as you want.

import io
from pandas import json_normalize

# Loading the json string into a structure
json_dict = json.load(io.StringIO(json_str))

df = pd.concat([pd.DataFrame(json_dict), 
                json_normalize(json_dict['nested_array_to_expand'])], 
                axis=1).drop('nested_array_to_expand', 1)

enter image description here

answered Sep 21 '22 15:09

Romain

Related questions
                            
                                Pandas groupby object in legend on plot
                            
                                Best way to convert fractions.Fraction to decimal.Decimal?
                            
                                Python 3 Decimal rounding half down with ROUND_HALF_UP context
                            
                                Intraclass Correlation in Python Module?
                            
                                Docker compose installing requirements.txt
                            
                                Django with NoSQL database
                            
                                pyspark parse fixed width text file
                            
                                Python 3.5 typed NamedTuple syntax produces SyntaxError
                            
                                TimeDistributed vs. TimeDistributedDense Keras
                            
                                Scipy sparse matrix multiplication
                            
                                supply a filename for a file-like object created by urlopen() or requests.get()
                            
                                convert python datetime with timezone to string
                            
                                SGDClassifier vs LogisticRegression with sgd solver in scikit-learn library
                            
                                Python + Ubuntu Linux + nohup error: [1]+ Exit
                            
                                Why doesn't '%matplotlib inline' work in python script?
                            
                                How can I delay the __init__ call until an attribute is accessed?
                            
                                AttributeError: module 'PyQt5.QtGui' has no attribute 'QWidget'
                            
                                How to get predicted values in Keras?
                            
                                what is meaning of hook that used in tensorflow
                            
                                pipenv and bash aliases

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With