My knowledge of packages such as pandas
is fairly shallow, and I've been looking for a solution to flatten data into rows. With a dict
like this, with a surrogate key called entry_id
:
data = [
{
"id": 1,
"entry_id": 123,
"type": "ticker",
"value": "IBM"
},
{
"id": 2,
"entry_id": 123,
"type": "company_name",
"value": "International Business Machines"
},
{
"id": 3,
"entry_id": 123,
"type": "cusip",
"value": "01234567"
},
{
"id": 4,
"entry_id": 321,
"type": "ticker",
"value": "AAPL"
},
{
"id": 5,
"entry_id": 321,
"type": "permno",
"value": "123456"
},
{
"id": 6,
"entry_id": 321,
"type": "company_name",
"value": "Apple, Inc."
},
{
"id": 7,
"entry_id": 321,
"type": "formation_date",
"value": "1976-04-01"
}
]
I would like to flatten the data into rows grouped by the surrogate key entry_id
to look like this (empty strings or None
values, doesn't matter):
[
{"entry_id": 123, "ticker": "IBM", "permno": "", "company_name": "International Business Machines", "cusip": "01234567", "formation_date": ""},
{"entry_id": 321, "ticker": "AAPL", "permno": "123456", "company_name": "Apple, Inc", "cusip": "", "formation_date": "1976-04-01"}
]
I've tried using DataFrame's groupby
and json_normalize
, but haven't been able to get the right level of sorcery for the desired result. I could walk the data in pure Python, but I'm certain that would not be a fast solution. I'm not sure how to specify that type
is the column, value
is the value, and entry_id
is the aggregation key. I'm open to packages other than pandas
as well.
Pandas have a nice inbuilt function called json_normalize() to flatten the simple to moderately semi-structured nested JSON structures to flat tables. Parameters: data – dict or list of dicts.
Flatten columns: use get_level_values() Flatten columns: use to_flat_index() Flatten columns: join column labels. Flatten rows: flatten all levels.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
json_normalize. Normalize semi-structured JSON data into a flat table. Unserialized JSON objects.
Pandas json_normalize() function is a quick, convenient, and powerful way for flattening JSON into a DataFrame. I hope this article will help you to save time in flattening JSON data. I recommend you to check out the documentation for the json_normalize() API and to know about other things you can do.
Pandas provides a nice utility function json_normalize for flattening semi-structured JSON objects. Let’s consider the following JSON object: json_normalize does a pretty good job of flatting the object into a pandas dataframe: However flattening objects with embedded arrays is not as trivial. Consider the following JSON object:
Often, the JSON data you will be working on is stored locally as a .json file. However, Pandas json_normalize () function only accepts a dict or a list of dicts. To work around it, you need help from a 3rd module, for example, the Python json module: data = json.loads (f.read ()) loads data using Python json module.
In this post, you will learn how to do that with Python. First load the json data with Pandas read_json method, then it’s loaded into a Pandas DataFrame.
We can create a dataframe from the given list of records, then pivot
the dataframe to reshape, fill
the NaN
values with empty string, then convert the pivoted frame to dictionary
df = pd.DataFrame(data)
df.pivot('entry_id', 'type', 'value').fillna('').reset_index().to_dict('r')
[{'entry_id': 123,
'company_name': 'International Business Machines',
'cusip': '01234567',
'formation_date': '',
'permno': '',
'ticker': 'IBM'},
{'entry_id': 321,
'company_name': 'Apple, Inc.',
'cusip': '',
'formation_date': '1976-04-01',
'permno': '123456',
'ticker': 'AAPL'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With