Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe masks to write values into new column

Based on this solution, I created several masks on a Pandas dataframe to create a new column which should be filled from different columns (based on the conditions).

The (simplified) code looks as follows:

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df['otherColumn1']
df.loc[mask1, 'newColumn'] = df['otherColumn2']

However upon execution of the third line I get the following error:

ValueError: cannot reindex from a duplicate axis

If I simply do this, it works:

df.loc[mask0, 'newColumn'] = 1

What am I doing wrong?

like image 233
beta Avatar asked Aug 17 '19 11:08

beta


3 Answers

You have to filter in both sides:

mask0 = (df['condition'] == 1)
mask1 = (df['condition'] == 0)

df.loc[mask0, 'newColumn'] = df.loc[mask0, 'otherColumn1']
df.loc[mask1, 'newColumn'] = df.loc[mask1, 'otherColumn2']

But here is better use numpy.select for avoid repeating code:

df['newColumn'] = np.select([mask0, mask1], 
                            [df['otherColumn1'], df['otherColumn2']], 
                            default=np.nan)
like image 59
jezrael Avatar answered Oct 13 '22 15:10

jezrael


You need to mask the "data provider" as well:

df.loc[mask0, 'newColumn'] = df[mask0, 'otherColumn1']
df.loc[mask1, 'newColumn'] = df[mask1, 'otherColumn2']

If the first condition is true if the latter is false and vice versa, we can use np.where(..):

df['newColumn'] = np.where(mask0, df['otherColumn0'], df['otherColumn2'])

or you can use np.select(..) in case the two can be both false where we fallback on the old values in case both conditons are False:

df['newColumn'] = np.select(
    [mask0, mask1],
    [df['otherColumn1'], df['otherColumn2']],
    default=df['newColumn']
)

Here of course we make the assumption that newColumn already exists in the dataframe (for example through some previous processing).

like image 41
Willem Van Onsem Avatar answered Oct 13 '22 17:10

Willem Van Onsem


Another solution with np.where:

df['newColumn'] = np.where(df['condition'].eq(1), df['otherColumn1'], df['condition'])
df['newColumn'] = np.where(df['condition'].eq(0), df['otherColumn2'], df['condition'])
like image 41
U12-Forward Avatar answered Oct 13 '22 16:10

U12-Forward