Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

normalize json but keep id from top level -

I have a json file from an API response that broadly follows the following schema

import pandas as pd 


j =[
  {
    "orders": [
      {
        "orderId": 0,
        "items": [
          { "item_1": "x", "item_price": 5.99 },
          { "item_1": "y", "item_price": 15.99 }
        ]
      }
    ]
  }
]

What I'm trying to do is to normalize the json but retain the orderId from the top level so I can re-join the datasets further down a data pipeline.

I thought this could be done with one of the keyword arguments in pd.json_normalize

df_orders = pd.json_normalize(j,record_path=['orders'])

print(df_orders)

   orderId                                              items
0        0  [{'item_1': 'x', 'item_price': 5.99}, {'item_1...

df_items = pd.json_normalize(j,record_path=['orders','items'])
print(df_items)

  item_1  item_price
0      x        5.99
1      y       15.99

What I would like is

  item_1  item_price  orders.orderId
0      x        5.99  0
1      y       15.99  0 

--

I've gone through the documentation here

and attempted to use

pd.json_normalize(j,record_path=['orders','items'], meta=['orders'])

which just appends the entire json object to each record.

  item_1  item_price                                             orders
0      x        5.99  {'orderId': 0, 'items': [{'item_1': 'x', 'item...
1      y       15.99  {'orderId': 0, 'items': [{'item_1': 'x', 'item...

any tips using pd.json_normalize?

I have no issue in parsing the json out manually and doing this in a 2 step process but want to avoid further depencies in my pipeline to keep things dynamic if i can help it.

like image 405
Umar.H Avatar asked Oct 17 '25 07:10

Umar.H


1 Answers

You can try putting an extra pair of square brackets on the meta parameter and specify upto orderId.

pd.json_normalize(j,record_path=['orders','items'], meta=[['orders', 'orderId']])

Result:

  item_1  item_price orders.orderId
0      x        5.99              0
1      y       15.99              0

The syntax of meta in pd.json_normalize is that whenever you want to access a second level field down from top level, you have to specify it under a second level list. The list in the second level in this case, i.e. ['orders', 'orderId'] acts like a path. We still can't skip specifying the top level list even when we have 2 items in this second level list. For reference, we can look at the examples in the official doc and look at e.g. the case we want to access the field governor in the example.

The field orderId is regarded as a second level field instead of top level, probably owing to the fact that the field orders is regarded as the top level, which in turn is because it is the data j that we pass to pd.json_normalize as input data and pd.json_normalize counts levels based on this input data j. Even though we specify record_path, these relative levels are still the same.

like image 50
SeaBean Avatar answered Oct 19 '25 21:10

SeaBean



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!