I'm facing a difficult problem that needs to be matched before exploding.
My problem is best described by data. It looks like below:
df = pd.DataFrame({
'A': [
[0.05, 0.055, 0.055, 0.06, 0.065, 0.07, 0.075, 0.075, 0.085, 0.09, 1.32],
[0.4, 0.06, 0.06, 0.13, 0.135, 0.145, 0.155, 0.17] ,
[3.81, 0.3, 0.4, 0.425, 0.445, 0.48, 0.51, 0.54, 0.58, 0.62, 0.66, 0.66, 0.705, 0.53, 0.57, 0.61],
[7.395, 0.075, 0.085, 0.09, 0.095, 0.1, 0.11, 0.12, 0.13, 0.14],
[0.105, 0.11, 0.12, 0.125, 0.135, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.205, 2.21]
],
'B' : [
[0.680, 1.320],
[0.520, 0.130, 0.135, 0.145, 0.155, 0.170],
[8.035, 3.810],
[0.945, 7.395],
[1.790, 2.210]
],
'C' : [
['08/01/91', '08/01/10'],
['09/01/92', '09/01/93', '09/01/94', '09/01/95', '09/01/96', '09/01/10'],
['11/01/91', '11/01/10'],
['09/01/93', '09/01/21'],
['12/01/92', '12/01/10']
]
})
df
A B C
0 [0.05, 0.055, 0.055, 0.06, 0.065, 0.07, 0.075, 0.075, 0.085, 0.09, 1.32] [0.68, 1.32] [08/01/91, 08/01/10]
1 [0.4, 0.06, 0.06, 0.13, 0.135, 0.145, 0.155, 0.17] [0.52, 0.13, 0.135, 0.145, 0.155, 0.17] [09/01/92, 09/01/93, 09/01/94, 09/01/95, 09/01/96, 09/01/10]
2 [3.81, 0.3, 0.4, 0.425, 0.445, 0.48, 0.51, 0.54, 0.58, 0.62, 0.66, 0.66, 0.705, 0.53, 0.57, 0.61] [8.035, 3.81] [11/01/91, 11/01/10]
3 [7.395, 0.075, 0.085, 0.09, 0.095, 0.1, 0.11, 0.12, 0.13, 0.14] [0.945, 7.395] [09/01/93, 09/01/21]
4 [0.105, 0.11, 0.12, 0.125, 0.135, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.205, 2.21] [1.79, 2.21] [12/01/92, 12/01/10]
It is guaranteed that the sum of elements of list in A is equal to sum of elements of list in B. Usually they are ordered, but there are cases where it is reverted.
For example cases like row 0, the first 10 elements sums up to 0.68, and 1.32 is matched in order.
However, row 2 is the other way around, as 3.81 matches the last element of B. B & C columns are from the same dataset, so they should be filpped over to match the order of A.
My desired output will be like below after matching & exploding:
A B C
0 0.05 0.68 08/01/91
0 0.055 0.68 08/01/91
0 0.055 0.68 08/01/91
0 0.06 0.68 08/01/91
0 0.065 0.68 08/01/91
0 0.07 0.68 08/01/91
0 0.075 0.68 08/01/91
0 0.085 0.68 08/01/91
0 0.09 0.68 08/01/91
0 1.32 1.32 08/01/10
...
2 3.81 3.81 11/01/10
2 0.3 8.035 11/01/91
2 0.4 8.035 11/01/91
2 0.425 8.035 11/01/91
2 0.445 8.035 11/01/91
2 0.48 8.035 11/01/91
2 0.51 8.035 11/01/91
2 0.54 8.035 11/01/91
2 0.58 8.035 11/01/91
2 0.62 8.035 11/01/91
2 0.66 8.035 11/01/91
2 0.66 8.035 11/01/91
2 0.705 8.035 11/01/91
2 0.52 8.035 11/01/91
2 0.57 8.035 11/01/91
2 0.61 8.035 11/01/91
Any ideas and approaches are deeply appreciated.
I found I made error on above data, and I corrected it. B & C they always have exact number of elements in their list.
1 0.4 0.520 09/01/92
1 0.06 0.520 09/01/92
1 0.06 0.520 09/01/92
1 0.13 0.130 09/01/93
1 0.135 0.135 09/01/94
1 0.145 0.145 09/01/95
1 0.155 0.155 09/01/96
1 0.17 0.17 09/01/10
The basic idea in the following example is to create 'matched' lists of equal lengths in each column for each row in the data first, and to then 'transpose' (explode is not really correct here) those lists. This can also easily scale for more columns, let me now if you need any help on generalizing the function
def match_row(row):
bc_mapping = {b: c for b, c in zip(row['B'], row['C'])}
common_elements = set(row['A']).intersection(set(row['B']))
sum_elements = set(row['B']).difference(common_elements)
assert len(sum_elements) == 1 # Sanity check
common_elements = sorted(common_elements)
sum_element = list(sum_elements)[0]
number_of_free_elements = len(row['A']) - len(common_elements)
return pd.Series({
'A': [element for element in row['A'] if element not in common_elements] + common_elements,
'B': [sum_element] * number_of_free_elements + common_elements,
'C': [bc_mapping[sum_element]] * number_of_free_elements + [bc_mapping[element] for element in common_elements]
})
df = df. \
apply(match_row, axis=1). \
aggregate('sum'). \
apply(pd.Series). \
transpose()
EDIT: Generalization to mutliple columns: The case of multiple columns is not so trivial, but the following should work:
df['D'] = df['B'].apply(lambda b: [random.randint(0, 10) for _ in b])
def match_row(row, variable_column, reference_column):
fixed_columns = row.index.tolist()
fixed_columns.remove(variable_column)
fixed_columns.remove(reference_column)
variable_elements = row[variable_column]
reference_elements = row[reference_column]
fixed_elements = row[fixed_columns].apply(pd.Series).T.values.tolist()
fixed_elements_mapping = {
reference: fixed_elements
for reference, fixed_elements
in zip(reference_elements, fixed_elements)
}
common_elements = set(variable_elements).intersection(set(reference_elements))
sum_elements = set(reference_elements).difference(common_elements)
assert len(sum_elements) == 1 # Sanity check
common_elements = sorted(common_elements)
sum_element = list(sum_elements)[0]
number_of_free_elements = len(variable_elements) - len(common_elements)
variable_and_reference_result = pd.Series({
variable_column: [element for element in row['A'] if element not in common_elements] + common_elements,
reference_column: [sum_element] * number_of_free_elements + common_elements,
})
fixed_coluumns_result = pd.Series({
column_name: [fixed_elements_mapping[sum_element][i]] * number_of_free_elements +
[fixed_elements_mapping[element][i] for element in common_elements]
for i, column_name
in enumerate(fixed_columns)
})
return pd.concat([variable_and_reference_result, fixed_coluumns_result])
df = df. \
apply(lambda row: match_row(row, 'A', 'B'), axis=1). \
aggregate('sum'). \
apply(pd.Series). \
transpose()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With