Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from two nested columns in one dataframe

I have a pandas dataframe that contains transactions. A transaction is either booked as a payment, or a ledger_account_booking. A single transaction can have multiple payments and/or multiple ledger account bookings. Therefore, my columns payments and ledger_account_bookings contain a list of dicts, where the number of lists in a dict can vary. A small example dataframe looks as follows:

transaction_id total_amount date payments ledger_account_bookings
4308 645,83 30-8-2024 [] []
4254 291,67 2-7-2024 [] [{'ledger_id': '4265', 'amount': '291,67'}]
4128 847 14-2-2024 [{'payment_id': '4128', 'amount': '847.0'}] []
4248 4286,98 25-6-2024 [{'payment_id': '4261', 'amount': '400.0'},
 {'payment_id': '4262', 'amount': '11.0'},
 {'payment_id': '4263', 'amount': '1668.51'},
 {'payment_id': '4264', 'amount': '1868.54'},
 {'payment_id': '4265', 'amount': '20.91'},
 {'payment_id': '4266', 'amount': '2.21'},
 {'payment_id': '4267', 'amount': '309.62'}]
[{'ledger_id' : '4265', 'amount': '6,19'}]
4192 6130,22 24-4-2024 [{'payment_id': '4193', 'amount': '9.68'}] [{'ledger_id': '4222', 'amount':'2106.0'},
 {'ledger_id': '4222','amount': '4014.54'}]
4090 1158,98 25-1-2024 [{'id': '4110','amount': '16.22'},
 {'id': '4111', 'amount': '84.0'},
 {'id': '4112', 'amount': '41.99'},
 {'id': '4113, 'amount': '9.11',}
 {'id': '4114', 'amount': '10.0'},
 {'id': '4115', 'amount': '997.16'}]
[{'ledger_id': '4231', 'amount': '-0.32'},
 {'ledger_id': '4231', 'amount': '-0.18'}]

What I want is that every dict in one of the columns payments or ledger_account_bookings becomes a row in my dataframe. Expected result would look something like this:

transaction_id total_amount date payment_id payment_amount ledger_id ledger_amount
4308 645,83 30-8-2024 NaN NaN NaN NaN
4254 291,67 2-7-2024 Nan NaN 4265 291,67
4128 847 14-2-2024 4128 847.0 NaN NaN
4248 4286,98 25-6-2024 4261 400.0 NaN NaN
4248 4286,98 25-6-2024 4262 11.0 NaN Nan
4248 4286,98 25-6-2024 4263 1668.51 NaN Nan
4248 4286,98 25-6-2024 4264 1868.4 NaN Nan
4248 4286,98 25-6-2024 4265 20.91 NaN Nan
4248 4286,98 25-6-2024 4266 2.21 NaN Nan
4248 4286,98 25-6-2024 4267 309.62 NaN Nan
4248 4286,98 25-6-2024 NaN NaN 4265 6,19
4192 6130,22 24-4-2024 4193 9.68 NaN NaN
4192 6130,22 24-4-2024 NaN NaN 4222 2106
4192 6130,22 24-4-2024 NaN NaN 4222 4014.54
4090 1158,98 25-1-2024 4110 16.22 NaN NaN
4090 1158,98 25-1-2024 4111 84.0 NaN NaN
4090 1158,98 25-1-2024 4112 41.99 NaN NaN
4090 1158,98 25-1-2024 4113 9.11 NaN NaN
4090 1158,98 25-1-2024 4114 10.0 NaN NaN
4090 1158,98 25-1-2024 4115 997.16 NaN NaN
4090 1158,98 25-1-2024 NaN NaN 4231 0.32
4090 1158,98 25-1-2024 NaN NaN 4231 0.18

For example, transaction 4248 has 7 payments and 1 ledger account booking. So the resulting dataframe would have 8 rows. transaction 4192 has 2 payments and 1 ledger account bookings, so resulting df should have 3 rows.

I know how to achieve this for one column, for example by using the following code:

df_explode = df_financial_mutations.explode(['payments'])
#Normalize the json column into separate columns
df_normalized = json_normalize(df_explode['payments'])
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

The problem is, I don't know how to do it for two columns. If I would call explode on ledger_account_bookings again, the result becomes murky since I already have exploded the payments column, and therefore 'duplicate' rows were introduced into my dataframe. So, where a payment was exploded, I now have two rows with exactly the same values in the ledger_account_bookings column. When I explode again, this time on the other column, those 'duplicates' are also exploded, so that my dataframe now contains rows of data that don't make sense.

How do I solve such a problem where I need to explode two columns at once? I've seen Efficient way to unnest (explode) multiple list columns in a pandas DataFrame but unfortunately the lists of payments and ledger_account_bookings can be of different size, and can be dynamic as well (e.g. it's possible to have 0-5 payments and 0-5 ledger_account_bookings, there is no fixed value)

Any help would be greatly appreciated.

like image 653
Psychotechnopath Avatar asked Jul 01 '26 20:07

Psychotechnopath


2 Answers

Universal solution for processing data by tuples:

#in tuple set original and new columns names prefixes
cols = [('payments', 'payments'),('ledger_account_bookings', 'ledger')]
L = []
for col, prefix in cols:
    df_explode = df_financial_mutations.pop(col).explode()
    #Normalize the json column into separate columns
    df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
    #Add prefix to the columns that were 'exploded'
    #Remove missing values if all NaNs per rows
    df_normalized = df_normalized.add_prefix(f'{prefix}_').dropna(how='all')

    L.append(df_normalized)

#join original columns to concanecated list of DataFrames
out = df_financial_mutations.join(pd.concat(L)).reset_index(drop=True)

#clean data - replacement missing values by another column
out['payments_id'] = out['payments_id'].fillna(out.pop('payments_payment_id'))
#renaming columns names
out = out.rename(columns={'ledger_ledger_id':'ledger_id'})

print (out)
    transaction_id total_amount        date payments_amount payments_id  \
0             4308      645,83   30-8-2024              NaN         NaN   
1             4254      291,67    2-7-2024              NaN         NaN   
2             4128         847   14-2-2024            847.0        4128   
3             4248     4286,98   25-6-2024            400.0        4261   
4             4248     4286,98   25-6-2024             11.0        4262   
5             4248     4286,98   25-6-2024          1668.51        4263   
6             4248     4286,98   25-6-2024          1868.54        4264   
7             4248     4286,98   25-6-2024            20.91        4265   
8             4248     4286,98   25-6-2024             2.21        4266   
9             4248     4286,98   25-6-2024           309.62        4267   
10            4248     4286,98   25-6-2024              NaN         NaN   
11            4192     6130,22   24-4-2024             9.68        4193   
12            4192     6130,22   24-4-2024              NaN         NaN   
13            4192     6130,22   24-4-2024              NaN         NaN   
14            4090     1158,98   25-1-2024            16.22        4110   
15            4090     1158,98   25-1-2024             84.0        4111   
16            4090     1158,98   25-1-2024            41.99        4112   
17            4090     1158,98   25-1-2024             9.11        4113   
18            4090     1158,98   25-1-2024             10.0        4114   
19            4090     1158,98   25-1-2024           997.16        4115   
20            4090     1158,98   25-1-2024              NaN         NaN   
21            4090     1158,98   25-1-2024              NaN         NaN   

   ledger_id ledger_amount  
0        NaN           NaN  
1       4265        291,67  
2        NaN           NaN  
3        NaN           NaN  
4        NaN           NaN  
5        NaN           NaN  
6        NaN           NaN  
7        NaN           NaN  
8        NaN           NaN  
9        NaN           NaN  
10      4265          6,19  
11       NaN           NaN  
12      4222        2106.0  
13      4222       4014.54  
14       NaN           NaN  
15       NaN           NaN  
16       NaN           NaN  
17       NaN           NaN  
18       NaN           NaN  
19       NaN           NaN  
20      4231         -0.32  
21      4231         -0.18  

I suggest process each column separately and join to original data - solution processing each column separately:

#extract column payments by pop and expoding
df_explode = df_financial_mutations.pop('payments').explode()
#Normalize the json column into separate columns 
#Rewrite new index by original values from exploded DataFrame
df_normalized = pd.json_normalize(df_explode).set_index(df_explode.index)
#Add prefix to the columns that were 'exploded'
df_normalized = df_normalized.add_prefix('payments_')

#Rewrite missing values from  payments_id by payments_payment_id and remove column
df_normalized['payments_id'] = (df_normalized['payments_id']
                                 .fillna(df_normalized.pop('payments_payment_id')))

#Remove missing values if all NaNs per rows
df_normalized = df_normalized.dropna(how='all')

print (df_normalized)
  payments_amount payments_id
2           847.0        4128
3           400.0        4261
3            11.0        4262
3         1668.51        4263
3         1868.54        4264
3           20.91        4265
3            2.21        4266
3          309.62        4267
4            9.68        4193
5           16.22        4110
5            84.0        4111
5           41.99        4112
5            9.11        4113
5            10.0        4114
5          997.16        4115

df_explode1 = df_financial_mutations.pop('ledger_account_bookings').explode()
#Normalize the json column into separate columns
#Rewrite new index by original values from exploded DataFrame
df_normalized1 = pd.json_normalize(df_explode1).set_index(df_explode1.index)
#Add prefix to the columns that were 'exploded'
df_normalized1 = df_normalized1.add_prefix('ledger_')
#Remove missing values if all NaNs per rows
df_normalized1 = df_normalized1.dropna(how='all')

print (df_normalized1)
  ledger_ledger_id ledger_amount
1             4265        291,67
3             4265          6,19
4             4222        2106.0
4             4222       4014.54
5             4231         -0.32
5             4231         -0.18

out = df_financial_mutations.join(pd.concat([df_normalized, df_normalized1]))
print (out)

   transaction_id total_amount        date payments_amount payments_id  \
0            4308      645,83   30-8-2024              NaN         NaN   
1            4254      291,67    2-7-2024              NaN         NaN   
2            4128         847   14-2-2024            847.0        4128   
3            4248     4286,98   25-6-2024            400.0        4261   
3            4248     4286,98   25-6-2024             11.0        4262   
3            4248     4286,98   25-6-2024          1668.51        4263   
3            4248     4286,98   25-6-2024          1868.54        4264   
3            4248     4286,98   25-6-2024            20.91        4265   
3            4248     4286,98   25-6-2024             2.21        4266   
3            4248     4286,98   25-6-2024           309.62        4267   
3            4248     4286,98   25-6-2024              NaN         NaN   
4            4192     6130,22   24-4-2024             9.68        4193   
4            4192     6130,22   24-4-2024              NaN         NaN   
4            4192     6130,22   24-4-2024              NaN         NaN   
5            4090     1158,98   25-1-2024            16.22        4110   
5            4090     1158,98   25-1-2024             84.0        4111   
5            4090     1158,98   25-1-2024            41.99        4112   
5            4090     1158,98   25-1-2024             9.11        4113   
5            4090     1158,98   25-1-2024             10.0        4114   
5            4090     1158,98   25-1-2024           997.16        4115   
5            4090     1158,98   25-1-2024              NaN         NaN   
5            4090     1158,98   25-1-2024              NaN         NaN   

  ledger_ledger_id ledger_amount  
0              NaN           NaN  
1             4265        291,67  
2              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3              NaN           NaN  
3             4265          6,19  
4              NaN           NaN  
4             4222        2106.0  
4             4222       4014.54  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5              NaN           NaN  
5             4231         -0.32  
5             4231         -0.18  

#Create default index if necessary
out = out.reset_index(drop=True)
like image 117
jezrael Avatar answered Jul 04 '26 10:07

jezrael


For generalisation I propose the following solution and example. The solution works also for more than two rows.

import pandas as pd
import numpy as np

def extract_nested(df: pd.DataFrame,
                   column: str, 
                   prefix: (bool, None, str) = None, 
                   *args, **kwargs):
    df_i = df[column].explode(ignore_index=False)  # Merge the extracted data back with the non-nested columns
    df_i = pd.json_normalize(df_i, *args, **kwargs).set_index(df_i.index)  # Normalize the JSON-like structure into tabular format by containing the index
    
    if prefix != False:  # Add prefix to the resulting columns if set
        df_i = df_i.add_prefix(f'{column}_' if prefix is None else prefix)
    return df_i

def vstack_extract_nested(df: pd.DataFrame, 
                          columns: list, 
                          prefix: (bool, None, str, list) = None):
    df_ = df[df.columns.difference(columns)] # Select all columns except the ones containing nested data for the later merge
    prefix = prefix if isinstance(prefix, list) else [prefix]*len(columns)  # get prefix in the right shape
    df_list = [df_[~df[columns].astype(bool).any(axis=1)]]  # DataFrame rows that don't contain nested data
    
    for i, p_i in zip(columns, prefix):
        df_i = extract_nested(df, i, prefix=p_i).dropna(how='all')
        df_list.append(pd.merge(df_, df_i, left_index=True, right_index=True, how='right'))  # Merge the extracted data back with the non-nested columns

    return pd.concat(df_list).sort_index() # concat DataFrame rows that don't contain nested data with the those who does

Example

An example with 3 nested columns

df = pd.DataFrame({'id': 100+np.arange(5), 'col1': [[]]*5, 'col2': [[]]*5, 'col3': [[]]*5})
df.at[1, 'col1'] = [{'a': 1, 'b': 2}]
df.at[3, 'col1'] = [{'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
df.at[2, 'col2'] = [{'A': 10, 'B': 20}]
df.at[3, 'col2'] = [{'A': 30, 'B': 40}, {'A': 50, 'B': 60}]
df.at[2, 'col3'] = [{'X': 100, 'Y': 200}]
df.at[3, 'col3'] = [{'X': 300, 'Y': 400}, {'X': 500, 'Y': 600}]

vstack_extract_nested(df, ['col1', 'col2', 'col3'], prefix=[None, '','c3_'])
#     id  col1_a  col1_b       A       B   c3_X   c3_Y
# 0  100     NaN     NaN     NaN     NaN    NaN    NaN
# 1  101     1.0     2.0     NaN     NaN    NaN    NaN
# 2  102     NaN     NaN    10.0    20.0    NaN    NaN
# 2  102     NaN     NaN     NaN     NaN  100.0  200.0
# 3  103     3.0     4.0     NaN     NaN    NaN    NaN
# 3  103     5.0     6.0     NaN     NaN    NaN    NaN
# 3  103     NaN     NaN    30.0    40.0    NaN    NaN
# 3  103     NaN     NaN    50.0    60.0    NaN    NaN
# 3  103     NaN     NaN     NaN     NaN  300.0  400.0
# 3  103     NaN     NaN     NaN     NaN  500.0  600.0
# 4  104     NaN     NaN     NaN     NaN    NaN    NaN

Example for OP's use case

df = pd.DataFrame({'transaction_id': [4308, 4254, 4128, 4090], 'payments': [[]]*4, 'ledger_account_bookings': [[]]*4})
df.at[1, 'ledger_account_bookings'] = [{'ledger_id': '4265', 'amount': '291,67'}]
df.at[2, 'payments'] = [{'payment_id': '4128', 'amount': '847.0'}]
df.at[3, 'ledger_account_bookings'] = [{'ledger_id': '4231', 'amount': '-0.32'},
                                       {'ledger_id': '4231', 'amount': '-0.18'}]
df.at[3, 'payments'] = [{'id': '4110','amount': '16.22'},
                        {'id': '4111', 'amount': '84.0'},
                        {'id': '4112', 'amount': '41.99'}]

# extract the nested columns
df_exp = vstack_extract_nested(df, ['ledger_account_bookings', 'payments'], prefix=['ledger_', 'payments_'])

# correct naming
df_exp.fillna(df_exp.pop('payments_payment_id'), inplace=True)  # correct naming of payments column (payment_id, payments_payment_id)
df_exp.rename(columns={'ledger_ledger_id':'ledger_id'}, inplace=True)  # correct naming of ledger column
df_exp
#    transaction_id ledger_id ledger_amount payments_amount payments_id
# 0            4308       NaN           NaN             NaN         NaN
# 1            4254      4265        291,67             NaN         NaN
# 2            4128       NaN           NaN           847.0         NaN
# 3            4090      4231         -0.32             NaN         NaN
# 3            4090      4231         -0.18             NaN         NaN
# 3            4090       NaN           NaN           16.22        4110
# 3            4090       NaN           NaN            84.0        4111
# 3            4090       NaN           NaN           41.99        4112
like image 21
kho Avatar answered Jul 04 '26 11:07

kho