Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching & conditional exploding in pandas

I'm facing a difficult problem that needs to be matched before exploding.

My problem is best described by data. It looks like below:

df = pd.DataFrame({
    'A': [
        [0.05, 0.055, 0.055, 0.06, 0.065, 0.07, 0.075, 0.075, 0.085, 0.09, 1.32],
        [0.4, 0.06, 0.06, 0.13, 0.135, 0.145, 0.155, 0.17] , 
        [3.81, 0.3, 0.4, 0.425, 0.445, 0.48, 0.51, 0.54, 0.58, 0.62, 0.66, 0.66, 0.705, 0.53, 0.57, 0.61], 
        [7.395, 0.075, 0.085, 0.09, 0.095, 0.1, 0.11, 0.12, 0.13, 0.14],
        [0.105, 0.11, 0.12, 0.125, 0.135, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.205, 2.21]
    ], 
    'B' : [
        [0.680, 1.320],
        [0.520, 0.130, 0.135, 0.145, 0.155, 0.170],
        [8.035, 3.810],
        [0.945, 7.395],
        [1.790, 2.210]
    ],
    'C' : [
        ['08/01/91', '08/01/10'],
        ['09/01/92', '09/01/93', '09/01/94', '09/01/95', '09/01/96', '09/01/10'],
        ['11/01/91', '11/01/10'],
        ['09/01/93', '09/01/21'],
        ['12/01/92', '12/01/10']
    ]
})
df

A   B   C
0   [0.05, 0.055, 0.055, 0.06, 0.065, 0.07, 0.075, 0.075, 0.085, 0.09, 1.32]    [0.68, 1.32]    [08/01/91, 08/01/10]
1   [0.4, 0.06, 0.06, 0.13, 0.135, 0.145, 0.155, 0.17]  [0.52, 0.13, 0.135, 0.145, 0.155, 0.17] [09/01/92, 09/01/93, 09/01/94, 09/01/95, 09/01/96, 09/01/10]
2   [3.81, 0.3, 0.4, 0.425, 0.445, 0.48, 0.51, 0.54, 0.58, 0.62, 0.66, 0.66, 0.705, 0.53, 0.57, 0.61]   [8.035, 3.81]   [11/01/91, 11/01/10]
3   [7.395, 0.075, 0.085, 0.09, 0.095, 0.1, 0.11, 0.12, 0.13, 0.14] [0.945, 7.395]  [09/01/93, 09/01/21]
4   [0.105, 0.11, 0.12, 0.125, 0.135, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.205, 2.21]  [1.79, 2.21]    [12/01/92, 12/01/10]

It is guaranteed that the sum of elements of list in A is equal to sum of elements of list in B. Usually they are ordered, but there are cases where it is reverted.

For example cases like row 0, the first 10 elements sums up to 0.68, and 1.32 is matched in order.

However, row 2 is the other way around, as 3.81 matches the last element of B. B & C columns are from the same dataset, so they should be filpped over to match the order of A.

My desired output will be like below after matching & exploding:

      A         B        C
0   0.05      0.68    08/01/91 
0   0.055     0.68    08/01/91 
0   0.055     0.68    08/01/91 
0   0.06      0.68    08/01/91 
0   0.065     0.68    08/01/91 
0   0.07      0.68    08/01/91 
0   0.075     0.68    08/01/91 
0   0.085     0.68    08/01/91 
0   0.09      0.68    08/01/91 
0   1.32      1.32    08/01/10 
...
2   3.81      3.81    11/01/10
2   0.3       8.035   11/01/91         
2   0.4       8.035   11/01/91         
2   0.425     8.035   11/01/91           
2   0.445     8.035   11/01/91           
2   0.48      8.035   11/01/91          
2   0.51      8.035   11/01/91         
2   0.54      8.035   11/01/91          
2   0.58      8.035   11/01/91   
2   0.62      8.035   11/01/91   
2   0.66      8.035   11/01/91   
2   0.66      8.035   11/01/91   
2   0.705     8.035   11/01/91    
2   0.52      8.035   11/01/91   
2   0.57      8.035   11/01/91   
2   0.61      8.035   11/01/91   

Any ideas and approaches are deeply appreciated.

I found I made error on above data, and I corrected it. B & C they always have exact number of elements in their list.

Row 1 case: my desired output will be:

1   0.4       0.520   09/01/92                   
1   0.06      0.520   09/01/92               
1   0.06      0.520   09/01/92               
1   0.13      0.130   09/01/93               
1   0.135     0.135   09/01/94                
1   0.145     0.145   09/01/95                
1   0.155     0.155   09/01/96                
1   0.17      0.17    09/01/10     

like image 426
Matthew Son Avatar asked Nov 25 '25 04:11

Matthew Son


1 Answers

The basic idea in the following example is to create 'matched' lists of equal lengths in each column for each row in the data first, and to then 'transpose' (explode is not really correct here) those lists. This can also easily scale for more columns, let me now if you need any help on generalizing the function

def match_row(row):
    bc_mapping = {b: c for b, c in zip(row['B'], row['C'])}
    common_elements = set(row['A']).intersection(set(row['B']))
    sum_elements = set(row['B']).difference(common_elements)
    assert len(sum_elements) == 1  # Sanity check

    common_elements = sorted(common_elements)
    sum_element = list(sum_elements)[0]
    number_of_free_elements = len(row['A']) - len(common_elements)

    return pd.Series({
         'A': [element for element in row['A'] if element not in common_elements] + common_elements,
         'B': [sum_element] * number_of_free_elements + common_elements,
         'C': [bc_mapping[sum_element]] * number_of_free_elements + [bc_mapping[element] for element in common_elements]
    })


df = df. \
    apply(match_row, axis=1). \
    aggregate('sum'). \
    apply(pd.Series). \
    transpose()

EDIT: Generalization to mutliple columns: The case of multiple columns is not so trivial, but the following should work:

df['D'] = df['B'].apply(lambda b: [random.randint(0, 10) for _ in b])

def match_row(row, variable_column, reference_column):
    fixed_columns = row.index.tolist()
    fixed_columns.remove(variable_column)
    fixed_columns.remove(reference_column)

    variable_elements = row[variable_column]
    reference_elements = row[reference_column]
    fixed_elements = row[fixed_columns].apply(pd.Series).T.values.tolist()

    fixed_elements_mapping = {
        reference: fixed_elements
        for reference, fixed_elements
        in zip(reference_elements, fixed_elements)
    }
    common_elements = set(variable_elements).intersection(set(reference_elements))
    sum_elements = set(reference_elements).difference(common_elements)
    assert len(sum_elements) == 1  # Sanity check

    common_elements = sorted(common_elements)
    sum_element = list(sum_elements)[0]
    number_of_free_elements = len(variable_elements) - len(common_elements)

    variable_and_reference_result = pd.Series({
        variable_column: [element for element in row['A'] if element not in common_elements] + common_elements,
        reference_column: [sum_element] * number_of_free_elements + common_elements,
    })
    fixed_coluumns_result = pd.Series({
        column_name: [fixed_elements_mapping[sum_element][i]] * number_of_free_elements +
                     [fixed_elements_mapping[element][i] for element in common_elements]
        for i, column_name
        in enumerate(fixed_columns)
    })
    return pd.concat([variable_and_reference_result, fixed_coluumns_result])


df = df. \
    apply(lambda row: match_row(row, 'A', 'B'), axis=1). \
    aggregate('sum'). \
    apply(pd.Series). \
    transpose()
like image 184
BStadlbauer Avatar answered Nov 27 '25 23:11

BStadlbauer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!