Python pandas: fast way to flatten JSON into rows by a surrogate key

Tags:

My knowledge of packages such as pandas is fairly shallow, and I've been looking for a solution to flatten data into rows. With a dict like this, with a surrogate key called entry_id:

data = [
    {
        "id": 1,
        "entry_id": 123,
        "type": "ticker",
        "value": "IBM"
    },
    {
        "id": 2,
        "entry_id": 123,
        "type": "company_name",
        "value": "International Business Machines"
    },
    {
        "id": 3,
        "entry_id": 123,
        "type": "cusip",
        "value": "01234567"
    },
    {
        "id": 4,
        "entry_id": 321,
        "type": "ticker",
        "value": "AAPL"
    },
    {
        "id": 5,
        "entry_id": 321,
        "type": "permno",
        "value": "123456"
    },
    {
        "id": 6,
        "entry_id": 321,
        "type": "company_name",
        "value": "Apple, Inc."
    },
    {
        "id": 7,
        "entry_id": 321,
        "type": "formation_date",
        "value": "1976-04-01"
    }
]

I would like to flatten the data into rows grouped by the surrogate key entry_id to look like this (empty strings or None values, doesn't matter):

[
    {"entry_id": 123, "ticker": "IBM", "permno": "", "company_name": "International Business Machines", "cusip": "01234567", "formation_date": ""},
    {"entry_id": 321, "ticker": "AAPL", "permno": "123456", "company_name": "Apple, Inc", "cusip": "", "formation_date": "1976-04-01"}
]

I've tried using DataFrame's groupby and json_normalize, but haven't been able to get the right level of sorcery for the desired result. I could walk the data in pure Python, but I'm certain that would not be a fast solution. I'm not sure how to specify that type is the column, value is the value, and entry_id is the aggregation key. I'm open to packages other than pandas as well.

351

asked May 27 '21 13:05

FlipperPA

Video Answer

1 Answers

We can create a dataframe from the given list of records, then pivot the dataframe to reshape, fill the NaN values with empty string, then convert the pivoted frame to dictionary

df = pd.DataFrame(data)
df.pivot('entry_id', 'type', 'value').fillna('').reset_index().to_dict('r')

[{'entry_id': 123,
  'company_name': 'International Business Machines',
  'cusip': '01234567',
  'formation_date': '',
  'permno': '',
  'ticker': 'IBM'},
 {'entry_id': 321,
  'company_name': 'Apple, Inc.',
  'cusip': '',
  'formation_date': '1976-04-01',
  'permno': '123456',
  'ticker': 'AAPL'}]

answered Sep 22 '22 12:09

Shubham Sharma

Related questions
                            
                                TypeError: cannot unpack non-iterable int objec
                            
                                'continue' the 'for' loop to the previous element
                            
                                How to perform a cumulative sum of distinct values in pandas dataframe
                            
                                Set the channel_priority in Conda environment.yaml
                            
                                How to calculate daily averages from noon to noon with pandas?
                            
                                Nginx reverse proxy on unix socket for uvicorn not working
                            
                                How to change pandas DataFrame.plot fontsize of xlabel?
                            
                                Alternative segmentation techniques other than watershed for soil particles in images
                            
                                Generating dictionaries to categorize tweets into pre-defined categories using NLTK
                            
                                In numpy, q1 = p[:] instead of q1 = p, yet p is modified when I modify q1? [duplicate]
                            
                                Is there any way to use Tkinter with Google Colaboratory?
                            
                                Module Not Found Error: No module named 'chart_studio'
                            
                                Differences between matplotlib and matplotlib-base?
                            
                                Inconsistent behavior when inserting a set into cells using .loc in pandas
                            
                                Explain a surprising parity in the rounding direction of apparent ties in the interval [0, 1]
                            
                                Why does Serverless produce an Invalid Cross-device link Error when trying to package or deploy?
                            
                                How do I get into the environment VS Code is using for pylance?
                            
                                How to place a futures market order using python-binance: APIError(code=-1111): Precision is over the maximum defined for this asset
                            
                                Unable to import pytorch_lightning on google colab
                            
                                RunTimeError in Twint : This event loop is already running

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python pandas: fast way to flatten JSON into rows by a surrogate key

Tags:

python

flatten

pandas

dataframe

FlipperPA

People also ask

Video Answer

1 Answers

Shubham Sharma

Recent Activity

Donate For Us