Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fill a column based on several other columns?

I have two dataframes like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        'A': list('aaabdcde'),
        'B': list('smnipiuy'),
        'C': list('zzzqqwll')
    }
)

df2 = pd.DataFrame(
    {
        'mapcol': list('abpppozl')
    }
)

   A  B  C
0  a  s  z
1  a  m  z
2  a  n  z
3  b  i  q
4  d  p  q
5  c  i  w
6  d  u  l
7  e  y  l

  mapcol
0      a
1      b
2      p
3      p
4      p
5      o
6      z
7      l

Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:

   A  B  C final
0  a  s  z     a  # <- values can be found in A and C, but A is preferred
1  a  m  z     a  # <- values can be found in A and C, but A is preferred
2  a  n  z     a  # <- values can be found in A and C, but A is preferred
3  b  i  q     b  # <- value can be found in A 
4  d  p  q     p  # <- value can be found in B
5  c  i  w   NaN  # none of the values can be mapped
6  d  u  l     l  # value can be found in C
7  e  y  l     l  # value can be found in C

A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):

preferred_order = ['A', 'B', 'C']

df1['final'] = np.nan

for col in preferred_order:
    df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])

which gives the desired outcome.

Does anyone see a solution that avoids the loop?

like image 605
Cleb Avatar asked Dec 22 '22 17:12

Cleb


2 Answers

you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc

preferred_order = ['A', 'B', 'C']

df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
                   [preferred_order]
                   .bfill(axis=1)
                   .iloc[:, 0]
               )
print (df1)
   A  B  C final
0  a  s  z     a
1  a  m  z     a
2  a  n  z     a
3  b  i  q     b
4  d  p  q     p
5  c  i  w   NaN
6  d  u  l     l
7  e  y  l     l
like image 89
Ben.T Avatar answered Jan 06 '23 04:01

Ben.T


Use:

order =  ['A', 'B', 'C'] # order of columns

d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)

Details:

Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.

print(d)
0    A
1    A
2    A
3    A
4    B
6    C
7    C
dtype: object

Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:

print(df1)
   A  B  C final
0  a  s  z     a
1  a  m  z     a
2  a  n  z     a
3  b  i  q     b
4  d  p  q     p
5  c  i  w   NaN
6  d  u  l     l
7  e  y  l     l
like image 22
Shubham Sharma Avatar answered Jan 06 '23 05:01

Shubham Sharma