Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: fast way to flatten JSON into rows by a surrogate key

My knowledge of packages such as pandas is fairly shallow, and I've been looking for a solution to flatten data into rows. With a dict like this, with a surrogate key called entry_id:

data = [
    {
        "id": 1,
        "entry_id": 123,
        "type": "ticker",
        "value": "IBM"
    },
    {
        "id": 2,
        "entry_id": 123,
        "type": "company_name",
        "value": "International Business Machines"
    },
    {
        "id": 3,
        "entry_id": 123,
        "type": "cusip",
        "value": "01234567"
    },
    {
        "id": 4,
        "entry_id": 321,
        "type": "ticker",
        "value": "AAPL"
    },
    {
        "id": 5,
        "entry_id": 321,
        "type": "permno",
        "value": "123456"
    },
    {
        "id": 6,
        "entry_id": 321,
        "type": "company_name",
        "value": "Apple, Inc."
    },
    {
        "id": 7,
        "entry_id": 321,
        "type": "formation_date",
        "value": "1976-04-01"
    }
]

I would like to flatten the data into rows grouped by the surrogate key entry_id to look like this (empty strings or None values, doesn't matter):

[
    {"entry_id": 123, "ticker": "IBM", "permno": "", "company_name": "International Business Machines", "cusip": "01234567", "formation_date": ""},
    {"entry_id": 321, "ticker": "AAPL", "permno": "123456", "company_name": "Apple, Inc", "cusip": "", "formation_date": "1976-04-01"}
]

I've tried using DataFrame's groupby and json_normalize, but haven't been able to get the right level of sorcery for the desired result. I could walk the data in pure Python, but I'm certain that would not be a fast solution. I'm not sure how to specify that type is the column, value is the value, and entry_id is the aggregation key. I'm open to packages other than pandas as well.

like image 351
FlipperPA Avatar asked May 27 '21 13:05

FlipperPA


People also ask

How do I flatten nested JSON in a data frame?

Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables. Parameters: data – dict or list of dicts.

How do I flatten a row in pandas?

Flatten columns: use get_level_values() Flatten columns: use to_flat_index() Flatten columns: join column labels. Flatten rows: flatten all levels.

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

What does PD json_normalize do?

json_normalize. Normalize semi-structured JSON data into a flat table. Unserialized JSON objects.

How to flatten JSON data into a Dataframe in pandas?

Pandas json_normalize() function is a quick, convenient, and powerful way for flattening JSON into a DataFrame. I hope this article will help you to save time in flattening JSON data. I recommend you to check out the documentation for the json_normalize() API and to know about other things you can do.

How do I flatten an object with embedded arrays in pandas?

Pandas provides a nice utility function json_normalize for flattening semi-structured JSON objects. Let’s consider the following JSON object: json_normalize does a pretty good job of flatting the object into a pandas dataframe: However flattening objects with embedded arrays is not as trivial. Consider the following JSON object:

How to normalize JSON data in pandas?

Often, the JSON data you will be working on is stored locally as a .json file. However, Pandas json_normalize () function only accepts a dict or a list of dicts. To work around it, you need help from a 3rd module, for example, the Python json module: data = json.loads (f.read ()) loads data using Python json module.

How to load JSON data into a Dataframe in Python?

In this post, you will learn how to do that with Python. First load the json data with Pandas read_json method, then it’s loaded into a Pandas DataFrame.


Video Answer


1 Answers

We can create a dataframe from the given list of records, then pivot the dataframe to reshape, fill the NaN values with empty string, then convert the pivoted frame to dictionary

df = pd.DataFrame(data)
df.pivot('entry_id', 'type', 'value').fillna('').reset_index().to_dict('r')

[{'entry_id': 123,
  'company_name': 'International Business Machines',
  'cusip': '01234567',
  'formation_date': '',
  'permno': '',
  'ticker': 'IBM'},
 {'entry_id': 321,
  'company_name': 'Apple, Inc.',
  'cusip': '',
  'formation_date': '1976-04-01',
  'permno': '123456',
  'ticker': 'AAPL'}]
like image 70
Shubham Sharma Avatar answered Sep 22 '22 12:09

Shubham Sharma