I have a pandas dataframe that contains transactions. A transaction is either booked as a payment, or a ledger_account_booking. A single transaction can have multiple payments and/or multiple ledger account bookings. Therefore, my columns payments and ledger_account_bookings contain a list of dicts, where the number of lists in a dict can vary. A small example dataframe looks as follows:
| transaction_id | total_amount | date | payments | ledger_account_bookings |
|---|---|---|---|---|
| 4308 | 645,83 | 30-8-2024 | [] | [] |
| 4254 | 291,67 | 2-7-2024 | [] | [{'ledger_id': '4265', 'amount': '291,67'}] |
| 4128 | 847 | 14-2-2024 | [{'payment_id': '4128', 'amount': '847.0'}] | [] |
| 4248 | 4286,98 | 25-6-2024 | [{'payment_id': '4261', 'amount': '400.0'}, {'payment_id': '4262', 'amount': '11.0'}, {'payment_id': '4263', 'amount': '1668.51'}, {'payment_id': '4264', 'amount': '1868.54'}, {'payment_id': '4265', 'amount': '20.91'}, {'payment_id': '4266', 'amount': '2.21'}, {'payment_id': '4267', 'amount': '309.62'}] |
[{'ledger_id' : '4265', 'amount': '6,19'}] |
| 4192 | 6130,22 | 24-4-2024 | [{'payment_id': '4193', 'amount': '9.68'}] | [{'ledger_id': '4222', 'amount':'2106.0'}, {'ledger_id': '4222','amount': '4014.54'}] |
| 4090 | 1158,98 | 25-1-2024 | [{'id': '4110','amount': '16.22'}, {'id': '4111', 'amount': '84.0'}, {'id': '4112', 'amount': '41.99'}, {'id': '4113, 'amount': '9.11',} {'id': '4114', 'amount': '10.0'}, {'id': '4115', 'amount': '997.16'}] |
[{'ledger_id': '4231', 'amount': '-0.32'}, {'ledger_id': '4231', 'amount': '-0.18'}] |
What I want is that every dict in one of the columns payments or ledger_account_bookings becomes a row in my dataframe. Expected result would look something like this:
| transaction_id | total_amount | date | payment_id | payment_amount | ledger_id | ledger_amount |
|---|---|---|---|---|---|---|
| 4308 | 645,83 | 30-8-2024 | NaN | NaN | NaN | NaN |
| 4254 | 291,67 | 2-7-2024 | Nan | NaN | 4265 | 291,67 |
| 4128 | 847 | 14-2-2024 | 4128 | 847.0 | NaN | NaN |
| 4248 | 4286,98 | 25-6-2024 | 4261 | 400.0 | NaN | NaN |
| 4248 | 4286,98 | 25-6-2024 | 4262 | 11.0 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | 4263 | 1668.51 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | 4264 | 1868.4 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | 4265 | 20.91 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | 4266 | 2.21 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | 4267 | 309.62 | NaN | Nan |
| 4248 | 4286,98 | 25-6-2024 | NaN | NaN | 4265 | 6,19 |
| 4192 | 6130,22 | 24-4-2024 | 4193 | 9.68 | NaN | NaN |
| 4192 | 6130,22 | 24-4-2024 | NaN | NaN | 4222 | 2106 |
| 4192 | 6130,22 | 24-4-2024 | NaN | NaN | 4222 | 4014.54 |
| 4090 | 1158,98 | 25-1-2024 | 4110 | 16.22 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | 4111 | 84.0 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | 4112 | 41.99 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | 4113 | 9.11 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | 4114 | 10.0 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | 4115 | 997.16 | NaN | NaN |
| 4090 | 1158,98 | 25-1-2024 | NaN | NaN | 4231 | 0.32 |
| 4090 | 1158,98 | 25-1-2024 | NaN | NaN | 4231 | 0.18 |
For example, transaction 4248 has 7 payments and 1 ledger account booking. So the resulting dataframe would have 8 rows. transaction 4192 has 2 payments and 1 ledger account bookings, so resulting df should have 3 rows.
I know how to achieve this for one column, for example by using the following code:
df_explode = df_financial_mutations.explode(['payments'])
#Normalize the json column into separate columns
df_normalized = json_normalize(df_explode['payments'])
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')
The problem is, I don't know how to do it for two columns. If I would call explode on ledger_account_bookings again, the result becomes murky since I already have exploded the payments column, and therefore 'duplicate' rows were introduced into my dataframe. So, where a payment was exploded, I now have two rows with exactly the same values in the ledger_account_bookings column. When I explode again, this time on the other column, those 'duplicates' are also exploded, so that my dataframe now contains rows of data that don't make sense.
How do I solve such a problem where I need to explode two columns at once? I've seen Efficient way to unnest (explode) multiple list columns in a pandas DataFrame but unfortunately the lists of payments and ledger_account_bookings can be of different size, and can be dynamic as well (e.g. it's possible to have 0-5 payments and 0-5 ledger_account_bookings, there is no fixed value)
Any help would be greatly appreciated.
Universal solution for processing data by tuples:
#in tuple set original and new columns names prefixes
cols = [('payments', 'payments'),('ledger_account_bookings', 'ledger')]
L = []
for col, prefix in cols:
df_explode = df_financial_mutations.pop(col).explode()
#Normalize the json column into separate columns
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
#Remove missing values if all NaNs per rows
df_normalized = df_normalized.add_prefix(f'{prefix}_').dropna(how='all')
L.append(df_normalized)
#join original columns to concanecated list of DataFrames
out = df_financial_mutations.join(pd.concat(L)).reset_index(drop=True)
#clean data - replacement missing values by another column
out['payments_id'] = out['payments_id'].fillna(out.pop('payments_payment_id'))
#renaming columns names
out = out.rename(columns={'ledger_ledger_id':'ledger_id'})
print (out)
transaction_id total_amount date payments_amount payments_id \
0 4308 645,83 30-8-2024 NaN NaN
1 4254 291,67 2-7-2024 NaN NaN
2 4128 847 14-2-2024 847.0 4128
3 4248 4286,98 25-6-2024 400.0 4261
4 4248 4286,98 25-6-2024 11.0 4262
5 4248 4286,98 25-6-2024 1668.51 4263
6 4248 4286,98 25-6-2024 1868.54 4264
7 4248 4286,98 25-6-2024 20.91 4265
8 4248 4286,98 25-6-2024 2.21 4266
9 4248 4286,98 25-6-2024 309.62 4267
10 4248 4286,98 25-6-2024 NaN NaN
11 4192 6130,22 24-4-2024 9.68 4193
12 4192 6130,22 24-4-2024 NaN NaN
13 4192 6130,22 24-4-2024 NaN NaN
14 4090 1158,98 25-1-2024 16.22 4110
15 4090 1158,98 25-1-2024 84.0 4111
16 4090 1158,98 25-1-2024 41.99 4112
17 4090 1158,98 25-1-2024 9.11 4113
18 4090 1158,98 25-1-2024 10.0 4114
19 4090 1158,98 25-1-2024 997.16 4115
20 4090 1158,98 25-1-2024 NaN NaN
21 4090 1158,98 25-1-2024 NaN NaN
ledger_id ledger_amount
0 NaN NaN
1 4265 291,67
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 4265 6,19
11 NaN NaN
12 4222 2106.0
13 4222 4014.54
14 NaN NaN
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 NaN NaN
20 4231 -0.32
21 4231 -0.18
I suggest process each column separately and join to original data - solution processing each column separately:
#extract column payments by pop and expoding
df_explode = df_financial_mutations.pop('payments').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')
#Rewrite missing values from payments_id by payments_payment_id and remove column
df_normalized['payments_id'] = (df_normalized['payments_id']
.fillna(df_normalized.pop('payments_payment_id')))
#Remove missing values if all NaNs per rows
df_normalized = df_normalized.dropna(how='all')
print (df_normalized)
payments_amount payments_id
2 847.0 4128
3 400.0 4261
3 11.0 4262
3 1668.51 4263
3 1868.54 4264
3 20.91 4265
3 2.21 4266
3 309.62 4267
4 9.68 4193
5 16.22 4110
5 84.0 4111
5 41.99 4112
5 9.11 4113
5 10.0 4114
5 997.16 4115
df_explode1 = df_financial_mutations.pop('ledger_account_bookings').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized1 = pd.json_normalize(df_explode1).set_index(df_explode1.index)
#Add prefix to the columns that were 'exploded'
df_normalized1 = df_normalized1.add_prefix('ledger_')
#Remove missing values if all NaNs per rows
df_normalized1 = df_normalized1.dropna(how='all')
print (df_normalized1)
ledger_ledger_id ledger_amount
1 4265 291,67
3 4265 6,19
4 4222 2106.0
4 4222 4014.54
5 4231 -0.32
5 4231 -0.18
out = df_financial_mutations.join(pd.concat([df_normalized, df_normalized1]))
print (out)
transaction_id total_amount date payments_amount payments_id \
0 4308 645,83 30-8-2024 NaN NaN
1 4254 291,67 2-7-2024 NaN NaN
2 4128 847 14-2-2024 847.0 4128
3 4248 4286,98 25-6-2024 400.0 4261
3 4248 4286,98 25-6-2024 11.0 4262
3 4248 4286,98 25-6-2024 1668.51 4263
3 4248 4286,98 25-6-2024 1868.54 4264
3 4248 4286,98 25-6-2024 20.91 4265
3 4248 4286,98 25-6-2024 2.21 4266
3 4248 4286,98 25-6-2024 309.62 4267
3 4248 4286,98 25-6-2024 NaN NaN
4 4192 6130,22 24-4-2024 9.68 4193
4 4192 6130,22 24-4-2024 NaN NaN
4 4192 6130,22 24-4-2024 NaN NaN
5 4090 1158,98 25-1-2024 16.22 4110
5 4090 1158,98 25-1-2024 84.0 4111
5 4090 1158,98 25-1-2024 41.99 4112
5 4090 1158,98 25-1-2024 9.11 4113
5 4090 1158,98 25-1-2024 10.0 4114
5 4090 1158,98 25-1-2024 997.16 4115
5 4090 1158,98 25-1-2024 NaN NaN
5 4090 1158,98 25-1-2024 NaN NaN
ledger_ledger_id ledger_amount
0 NaN NaN
1 4265 291,67
2 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 NaN NaN
3 4265 6,19
4 NaN NaN
4 4222 2106.0
4 4222 4014.54
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 NaN NaN
5 4231 -0.32
5 4231 -0.18
#Create default index if necessary
out = out.reset_index(drop=True)
For generalisation I propose the following solution and example. The solution works also for more than two rows.
import pandas as pd
import numpy as np
def extract_nested(df: pd.DataFrame,
column: str,
prefix: (bool, None, str) = None,
*args, **kwargs):
df_i = df[column].explode(ignore_index=False) # Merge the extracted data back with the non-nested columns
df_i = pd.json_normalize(df_i, *args, **kwargs).set_index(df_i.index) # Normalize the JSON-like structure into tabular format by containing the index
if prefix != False: # Add prefix to the resulting columns if set
df_i = df_i.add_prefix(f'{column}_' if prefix is None else prefix)
return df_i
def vstack_extract_nested(df: pd.DataFrame,
columns: list,
prefix: (bool, None, str, list) = None):
df_ = df[df.columns.difference(columns)] # Select all columns except the ones containing nested data for the later merge
prefix = prefix if isinstance(prefix, list) else [prefix]*len(columns) # get prefix in the right shape
df_list = [df_[~df[columns].astype(bool).any(axis=1)]] # DataFrame rows that don't contain nested data
for i, p_i in zip(columns, prefix):
df_i = extract_nested(df, i, prefix=p_i).dropna(how='all')
df_list.append(pd.merge(df_, df_i, left_index=True, right_index=True, how='right')) # Merge the extracted data back with the non-nested columns
return pd.concat(df_list).sort_index() # concat DataFrame rows that don't contain nested data with the those who does
An example with 3 nested columns
df = pd.DataFrame({'id': 100+np.arange(5), 'col1': [[]]*5, 'col2': [[]]*5, 'col3': [[]]*5})
df.at[1, 'col1'] = [{'a': 1, 'b': 2}]
df.at[3, 'col1'] = [{'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df.at[2, 'col2'] = [{'A': 10, 'B': 20}]
df.at[3, 'col2'] = [{'A': 30, 'B': 40}, {'A': 50, 'B': 60}]
df.at[2, 'col3'] = [{'X': 100, 'Y': 200}]
df.at[3, 'col3'] = [{'X': 300, 'Y': 400}, {'X': 500, 'Y': 600}]
vstack_extract_nested(df, ['col1', 'col2', 'col3'], prefix=[None, '','c3_'])
# id col1_a col1_b A B c3_X c3_Y
# 0 100 NaN NaN NaN NaN NaN NaN
# 1 101 1.0 2.0 NaN NaN NaN NaN
# 2 102 NaN NaN 10.0 20.0 NaN NaN
# 2 102 NaN NaN NaN NaN 100.0 200.0
# 3 103 3.0 4.0 NaN NaN NaN NaN
# 3 103 5.0 6.0 NaN NaN NaN NaN
# 3 103 NaN NaN 30.0 40.0 NaN NaN
# 3 103 NaN NaN 50.0 60.0 NaN NaN
# 3 103 NaN NaN NaN NaN 300.0 400.0
# 3 103 NaN NaN NaN NaN 500.0 600.0
# 4 104 NaN NaN NaN NaN NaN NaN
df = pd.DataFrame({'transaction_id': [4308, 4254, 4128, 4090], 'payments': [[]]*4, 'ledger_account_bookings': [[]]*4})
df.at[1, 'ledger_account_bookings'] = [{'ledger_id': '4265', 'amount': '291,67'}]
df.at[2, 'payments'] = [{'payment_id': '4128', 'amount': '847.0'}]
df.at[3, 'ledger_account_bookings'] = [{'ledger_id': '4231', 'amount': '-0.32'},
{'ledger_id': '4231', 'amount': '-0.18'}]
df.at[3, 'payments'] = [{'id': '4110','amount': '16.22'},
{'id': '4111', 'amount': '84.0'},
{'id': '4112', 'amount': '41.99'}]
# extract the nested columns
df_exp = vstack_extract_nested(df, ['ledger_account_bookings', 'payments'], prefix=['ledger_', 'payments_'])
# correct naming
df_exp.fillna(df_exp.pop('payments_payment_id'), inplace=True) # correct naming of payments column (payment_id, payments_payment_id)
df_exp.rename(columns={'ledger_ledger_id':'ledger_id'}, inplace=True) # correct naming of ledger column
df_exp
# transaction_id ledger_id ledger_amount payments_amount payments_id
# 0 4308 NaN NaN NaN NaN
# 1 4254 4265 291,67 NaN NaN
# 2 4128 NaN NaN 847.0 NaN
# 3 4090 4231 -0.32 NaN NaN
# 3 4090 4231 -0.18 NaN NaN
# 3 4090 NaN NaN 16.22 4110
# 3 4090 NaN NaN 84.0 4111
# 3 4090 NaN NaN 41.99 4112
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With