Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Select row value from specific column based on value from other columns

This is little convoluted, but I'll just show my data

I constructed following dataframe:

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2
867    1019.11   1027.64  1022.68   XYZ   YYY
873    1018.04   1027.58  1022.81   XYZ   ZZZ

I would want to select values from columns based on Select1 and Select2 strings by matching on part of a column name. In a first row, this would be

1019.11 and 1027.64 (column Mid_XYZ and Mid_YYY) - because Select1 has string XYZ and Select2 has string YYY.

where, in a second row

1018.04 and 1022.81 (column Mid_XYZ and Mid_ZZZ)

Later, I plan to store sum of those values in new column. DataFrame will look like this

      Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2 Sum
867    1019.11   1027.64  1022.68   XYZ   YYY   2046.75
873    1018.04   1027.58  1022.81   XYZ   ZZZ   2040.85

I can change column names to exact matching, but there should be some solution with regex? I know about df.filter(regex='XYZ'), but how can I do it row-wise?

like image 586
Alex16237 Avatar asked Dec 18 '22 12:12

Alex16237


2 Answers

Use the following vectorized solution:

import numpy as np

# clean rows
clean = df.columns.str.replace('^Mid_', '', regex=True)

# find matching column indices
s1 = np.argmax(clean.values == df['Select1'].values[:, None], axis=1)
s2 = np.argmax(clean.values == df['Select2'].values[:, None], axis=1)

# index and sum
df['Sum'] = df.values[np.arange(len(s1)), s1] + df.values[np.arange(len(s2)), s2]

print(df)

Output

     Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
867  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
873  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85
like image 179
Dani Mesejo Avatar answered Feb 06 '23 18:02

Dani Mesejo


import pandas as pd

If you have:

df=pd.DataFrame.from_dict({'Mid_XYZ':[1019.11,1018.04],
                           'Mid_YYY':[1027.64,1027.58],
                           'Mid_ZZZ':[1022.68,1022.81],
                           'Select1':['XYZ','XYZ'],
                           'Select2':['YYY','ZZZ']})

You can do:

df['Sum']=df.apply(lambda row:
                   row['Mid_'+row['Select1']]+\
                   row['Mid_'+row['Select2']],
                   axis=1)

df will be:

   Mid_XYZ  Mid_YYY  Mid_ZZZ Select1 Select2      Sum
0  1019.11  1027.64  1022.68     XYZ     YYY  2046.75
1  1018.04  1027.58  1022.81     XYZ     ZZZ  2040.85

If you don't like lambda functions, can achieve the same result by defining a function:

def sumfunc(row):
    return row['Mid_'+row['Select1']]+row['Mid_'+row['Select2']]

Then:

df['Sum']=df.apply(sumfunc,axis=1)
like image 29
zabop Avatar answered Feb 06 '23 17:02

zabop