Extracting data from two nested columns in one dataframe

Question

I have a pandas dataframe that contains transactions. A transaction is either booked as a payment, or a ledger_account_booking. A single transaction can have multiple payments and/or multiple ledger account bookings. Therefore, my columns payments and ledger_account_bookings contain a list of dicts, where the number of lists in a dict can vary. A small example dataframe looks as follows:

transaction_id	total_amount	date	payments	ledger_account_bookings
4308	645,83	30-8-2024	[]	[]
4254	291,67	2-7-2024	[]	[{'ledger_id': '4265', 'amount': '291,67'}]
4128	847	14-2-2024	[{'payment_id': '4128', 'amount': '847.0'}]	[]
4248	4286,98	25-6-2024	[{'payment_id': '4261', 'amount': '400.0'}, {'payment_id': '4262', 'amount': '11.0'}, {'payment_id': '4263', 'amount': '1668.51'}, {'payment_id': '4264', 'amount': '1868.54'}, {'payment_id': '4265', 'amount': '20.91'}, {'payment_id': '4266', 'amount': '2.21'}, {'payment_id': '4267', 'amount': '309.62'}]	[{'ledger_id' : '4265', 'amount': '6,19'}]
4192	6130,22	24-4-2024	[{'payment_id': '4193', 'amount': '9.68'}]	[{'ledger_id': '4222', 'amount':'2106.0'}, {'ledger_id': '4222','amount': '4014.54'}]
4090	1158,98	25-1-2024	[{'id': '4110','amount': '16.22'}, {'id': '4111', 'amount': '84.0'}, {'id': '4112', 'amount': '41.99'}, {'id': '4113, 'amount': '9.11',} {'id': '4114', 'amount': '10.0'}, {'id': '4115', 'amount': '997.16'}]	[{'ledger_id': '4231', 'amount': '-0.32'}, {'ledger_id': '4231', 'amount': '-0.18'}]

What I want is that every dict in one of the columns payments or ledger_account_bookings becomes a row in my dataframe. Expected result would look something like this:

transaction_id	total_amount	date	payment_id	payment_amount	ledger_id	ledger_amount
4308	645,83	30-8-2024	NaN	NaN	NaN	NaN
4254	291,67	2-7-2024	Nan	NaN	4265	291,67
4128	847	14-2-2024	4128	847.0	NaN	NaN
4248	4286,98	25-6-2024	4261	400.0	NaN	NaN
4248	4286,98	25-6-2024	4262	11.0	NaN	Nan
4248	4286,98	25-6-2024	4263	1668.51	NaN	Nan
4248	4286,98	25-6-2024	4264	1868.4	NaN	Nan
4248	4286,98	25-6-2024	4265	20.91	NaN	Nan
4248	4286,98	25-6-2024	4266	2.21	NaN	Nan
4248	4286,98	25-6-2024	4267	309.62	NaN	Nan
4248	4286,98	25-6-2024	NaN	NaN	4265	6,19
4192	6130,22	24-4-2024	4193	9.68	NaN	NaN
4192	6130,22	24-4-2024	NaN	NaN	4222	2106
4192	6130,22	24-4-2024	NaN	NaN	4222	4014.54
4090	1158,98	25-1-2024	4110	16.22	NaN	NaN
4090	1158,98	25-1-2024	4111	84.0	NaN	NaN
4090	1158,98	25-1-2024	4112	41.99	NaN	NaN
4090	1158,98	25-1-2024	4113	9.11	NaN	NaN
4090	1158,98	25-1-2024	4114	10.0	NaN	NaN
4090	1158,98	25-1-2024	4115	997.16	NaN	NaN
4090	1158,98	25-1-2024	NaN	NaN	4231	0.32
4090	1158,98	25-1-2024	NaN	NaN	4231	0.18

For example, transaction 4248 has 7 payments and 1 ledger account booking. So the resulting dataframe would have 8 rows. transaction 4192 has 2 payments and 1 ledger account bookings, so resulting df should have 3 rows.

I know how to achieve this for one column, for example by using the following code:

df_explode = df_financial_mutations.explode(['payments'])
#Normalize the json column into separate columns
df_normalized = json_normalize(df_explode['payments'])
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

The problem is, I don't know how to do it for two columns. If I would call explode on ledger_account_bookings again, the result becomes murky since I already have exploded the payments column, and therefore 'duplicate' rows were introduced into my dataframe. So, where a payment was exploded, I now have two rows with exactly the same values in the ledger_account_bookings column. When I explode again, this time on the other column, those 'duplicates' are also exploded, so that my dataframe now contains rows of data that don't make sense.

How do I solve such a problem where I need to explode two columns at once? I've seen Efficient way to unnest (explode) multiple list columns in a pandas DataFrame but unfortunately the lists of payments and ledger_account_bookings can be of different size, and can be dynamic as well (e.g. it's possible to have 0-5 payments and 0-5 ledger_account_bookings, there is no fixed value)

Any help would be greatly appreciated.

jezrael · Accepted Answer

Universal solution for processing data by tuples:

#in tuple set original and new columns names prefixes
cols = [('payments', 'payments'),('ledger_account_bookings', 'ledger')]
L = []
for col, prefix in cols:
    df_explode = df_financial_mutations.pop(col).explode()
    #Normalize the json column into separate columns
    df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
    #Add prefix to the columns that were 'exploded'
    #Remove missing values if all NaNs per rows
    df_normalized = df_normalized.add_prefix(f'{prefix}_').dropna(how='all')

    L.append(df_normalized)

#join original columns to concanecated list of DataFrames
out = df_financial_mutations.join(pd.concat(L)).reset_index(drop=True)

#clean data - replacement missing values by another column
out['payments_id'] = out['payments_id'].fillna(out.pop('payments_payment_id'))
#renaming columns names
out = out.rename(columns={'ledger_ledger_id':'ledger_id'})

print (out)
    transaction_id total_amount        date payments_amount payments_id  \
0             4308      645,83   30-8-2024              NaN         NaN   
1             4254      291,67    2-7-2024              NaN         NaN   
2             4128         847   14-2-2024            847.0        4128   
3             4248     4286,98   25-6-2024            400.0        4261   
4             4248     4286,98   25-6-2024             11.0        4262   
5             4248     4286,98   25-6-2024          1668.51        4263   
6             4248     4286,98   25-6-2024          1868.54        4264   
7             4248     4286,98   25-6-2024            20.91        4265   
8             4248     4286,98   25-6-2024             2.21        4266   
9             4248     4286,98   25-6-2024           309.62        4267   
10            4248     4286,98   25-6-2024              NaN         NaN   
11            4192     6130,22   24-4-2024             9.68        4193   
12            4192     6130,22   24-4-2024              NaN         NaN   
13            4192     6130,22   24-4-2024              NaN         NaN   
14            4090     1158,98   25-1-2024            16.22        4110   
15            4090     1158,98   25-1-2024             84.0        4111   
16            4090     1158,98   25-1-2024            41.99        4112   
17            4090     1158,98   25-1-2024             9.11        4113   
18            4090     1158,98   25-1-2024             10.0        4114   
19            4090     1158,98   25-1-2024           997.16        4115   
20            4090     1158,98   25-1-2024              NaN         NaN   
21            4090     1158,98   25-1-2024              NaN         NaN   

   ledger_id ledger_amount  
0        NaN           NaN  
1       4265        291,67  
2        NaN           NaN  
3        NaN           NaN  
4        NaN           NaN  
5        NaN           NaN  
6        NaN           NaN  
7        NaN           NaN  
8        NaN           NaN  
9        NaN           NaN  
10      4265          6,19  
11       NaN           NaN  
12      4222        2106.0  
13      4222       4014.54  
14       NaN           NaN  
15       NaN           NaN  
16       NaN           NaN  
17       NaN           NaN  
18       NaN           NaN  
19       NaN           NaN  
20      4231         -0.32  
21      4231         -0.18

I suggest process each column separately and join to original data - solution processing each column separately:

#extract column payments by pop and expoding
df_explode = df_financial_mutations.pop('payments').explode()
#Normalize the json column into separate columns 
#Rewrite new index by original values from exploded DataFrame
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

#Rewrite missing values from  payments_id by payments_payment_id and remove column
df_normalized['payments_id'] = (df_normalized['payments_id']
                                 .fillna(df_normalized.pop('payments_payment_id')))

#Remove missing values if all NaNs per rows
df_normalized = df_normalized.dropna(how='all')

print (df_normalized)
  payments_amount payments_id
2           847.0        4128
3           400.0        4261
3            11.0        4262
3         1668.51        4263
3         1868.54        4264
3           20.91        4265
3            2.21        4266
3          309.62        4267
4            9.68        4193
5           16.22        4110
5            84.0        4111
5           41.99        4112
5            9.11        4113
5            10.0        4114
5          997.16        4115

df_explode1 = df_financial_mutations.pop('ledger_account_bookings').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized1 = pd.json_normalize(df_explode1).set_index(df_explode1.index)
#Add prefix to the columns that were 'exploded'
df_normalized1 = df_normalized1.add_prefix('ledger_')
#Remove missing values if all NaNs per rows
df_normalized1 = df_normalized1.dropna(how='all')

print (df_normalized1)
  ledger_ledger_id ledger_amount
1             4265        291,67
3             4265          6,19
4             4222        2106.0
4             4222       4014.54
5             4231         -0.32
5             4231         -0.18

out = df_financial_mutations.join(pd.concat([df_normalized, df_normalized1]))
print (out)

   transaction_id total_amount        date payments_amount payments_id  \
0            4308      645,83   30-8-2024              NaN         NaN   
1            4254      291,67    2-7-2024              NaN         NaN   
2            4128         847   14-2-2024            847.0        4128   
3            4248     4286,98   25-6-2024            400.0        4261   
3            4248     4286,98   25-6-2024             11.0        4262   
3            4248     4286,98   25-6-2024          1668.51        4263   
3            4248     4286,98   25-6-2024          1868.54        4264   
3            4248     4286,98   25-6-2024            20.91        4265   
3            4248     4286,98   25-6-2024             2.21        4266   
3            4248     4286,98   25-6-2024           309.62        4267   
3            4248     4286,98   25-6-2024              NaN         NaN   
4            4192     6130,22   24-4-2024             9.68        4193   
4            4192     6130,22   24-4-2024              NaN         NaN   
4            4192     6130,22   24-4-2024              NaN         NaN   
5            4090     1158,98   25-1-2024            16.22        4110   
5            4090     1158,98   25-1-2024             84.0        4111   
5            4090     1158,98   25-1-2024            41.99        4112   
5            4090     1158,98   25-1-2024             9.11        4113   
5            4090     1158,98   25-1-2024             10.0        4114   
5            4090     1158,98   25-1-2024           997.16        4115   
5            4090     1158,98   25-1-2024              NaN         NaN   
5            4090     1158,98   25-1-2024              NaN         NaN   

  ledger_ledger_id ledger_amount  
0              NaN           NaN  
1             4265        291,67  
2              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3             4265          6,19  
4              NaN           NaN  
4             4222        2106.0  
4             4222       4014.54  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5             4231         -0.32  
5             4231         -0.18

#Create default index if necessary
out = out.reset_index(drop=True)

kho · Answer

For generalisation I propose the following solution and example. The solution works also for more than two rows.

import pandas as pd
import numpy as np

def extract_nested(df: pd.DataFrame,
                   column: str, 
                   prefix: (bool, None, str) = None, 
                   *args, **kwargs):
    df_i = df[column].explode(ignore_index=False)  # Merge the extracted data back with the non-nested columns
    df_i = pd.json_normalize(df_i, *args, **kwargs).set_index(df_i.index)  # Normalize the JSON-like structure into tabular format by containing the index
    
    if prefix != False:  # Add prefix to the resulting columns if set
        df_i = df_i.add_prefix(f'{column}_' if prefix is None else prefix)
    return df_i

def vstack_extract_nested(df: pd.DataFrame, 
                          columns: list, 
                          prefix: (bool, None, str, list) = None):
    df_ = df[df.columns.difference(columns)] # Select all columns except the ones containing nested data for the later merge
    prefix = prefix if isinstance(prefix, list) else [prefix]*len(columns)  # get prefix in the right shape
    df_list = [df_[~df[columns].astype(bool).any(axis=1)]]  # DataFrame rows that don't contain nested data
    
    for i, p_i in zip(columns, prefix):
        df_i = extract_nested(df, i, prefix=p_i).dropna(how='all')
        df_list.append(pd.merge(df_, df_i, left_index=True, right_index=True, how='right'))  # Merge the extracted data back with the non-nested columns

    return pd.concat(df_list).sort_index() # concat DataFrame rows that don't contain nested data with the those who does

Example

An example with 3 nested columns

df = pd.DataFrame({'id': 100+np.arange(5), 'col1': [[]]*5, 'col2': [[]]*5, 'col3': [[]]*5})
df.at[1, 'col1'] = [{'a': 1, 'b': 2}]
df.at[3, 'col1'] = [{'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df.at[2, 'col2'] = [{'A': 10, 'B': 20}]
df.at[3, 'col2'] = [{'A': 30, 'B': 40}, {'A': 50, 'B': 60}]
df.at[2, 'col3'] = [{'X': 100, 'Y': 200}]
df.at[3, 'col3'] = [{'X': 300, 'Y': 400}, {'X': 500, 'Y': 600}]

vstack_extract_nested(df, ['col1', 'col2', 'col3'], prefix=[None, '','c3_'])
#     id  col1_a  col1_b       A       B   c3_X   c3_Y
# 0  100     NaN     NaN     NaN     NaN    NaN    NaN
# 1  101     1.0     2.0     NaN     NaN    NaN    NaN
# 2  102     NaN     NaN    10.0    20.0    NaN    NaN
# 2  102     NaN     NaN     NaN     NaN  100.0  200.0
# 3  103     3.0     4.0     NaN     NaN    NaN    NaN
# 3  103     5.0     6.0     NaN     NaN    NaN    NaN
# 3  103     NaN     NaN    30.0    40.0    NaN    NaN
# 3  103     NaN     NaN    50.0    60.0    NaN    NaN
# 3  103     NaN     NaN     NaN     NaN  300.0  400.0
# 3  103     NaN     NaN     NaN     NaN  500.0  600.0
# 4  104     NaN     NaN     NaN     NaN    NaN    NaN

Example for OP's use case

df = pd.DataFrame({'transaction_id': [4308, 4254, 4128, 4090], 'payments': [[]]*4, 'ledger_account_bookings': [[]]*4})
df.at[1, 'ledger_account_bookings'] = [{'ledger_id': '4265', 'amount': '291,67'}]
df.at[2, 'payments'] = [{'payment_id': '4128', 'amount': '847.0'}]
df.at[3, 'ledger_account_bookings'] = [{'ledger_id': '4231', 'amount': '-0.32'},
                                       {'ledger_id': '4231', 'amount': '-0.18'}]
df.at[3, 'payments'] = [{'id': '4110','amount': '16.22'},
                        {'id': '4111', 'amount': '84.0'},
                        {'id': '4112', 'amount': '41.99'}]

# extract the nested columns
df_exp = vstack_extract_nested(df, ['ledger_account_bookings', 'payments'], prefix=['ledger_', 'payments_'])

# correct naming
df_exp.fillna(df_exp.pop('payments_payment_id'), inplace=True)  # correct naming of payments column (payment_id, payments_payment_id)
df_exp.rename(columns={'ledger_ledger_id':'ledger_id'}, inplace=True)  # correct naming of ledger column
df_exp
#    transaction_id ledger_id ledger_amount payments_amount payments_id
# 0            4308       NaN           NaN             NaN         NaN
# 1            4254      4265        291,67             NaN         NaN
# 2            4128       NaN           NaN           847.0         NaN
# 3            4090      4231         -0.32             NaN         NaN
# 3            4090      4231         -0.18             NaN         NaN
# 3            4090       NaN           NaN           16.22        4110
# 3            4090       NaN           NaN            84.0        4111
# 3            4090       NaN           NaN           41.99        4112

Extracting data from two nested columns in one dataframe

Tags:

python

pandas

dataframe

Psychotechnopath

2 Answers

jezrael

Example

Example for OP's use case

kho

Recent Activity

Donate For Us

Extracting data from two nested columns in one dataframe

Tags:

python

pandas

dataframe

Psychotechnopath

2 Answers

jezrael

Example

Example for OP's use case

kho

Related questions

Recent Activity

Donate For Us