Here is an example dataframe: <pre class="prettyprint"><code>X Y Z 1 0 1 0 1 0 1 1 1 </code></pre> Now, here is the rule I've come up with: <ul> <li>X is left as is</li> <li>If Y is equal to 1 set the corresponding value in X to 0</li> <li>If Z is equal to 1 set the corresponding value in X and Y to 0</li> </ul> The final dataframe should look like this: <pre class="prettyprint"><code>X Y Z 0 0 1 0 1 0 0 0 1 </code></pre> My first thought at a solution is this: <pre class="prettyprint"><code>df_null_list = ['X'] for i in ['Y', 'Z']: df[df[i] == 1][df_null_list] = 0 df_null_list.append(i) </code></pre> When I do this and sum across the y axis, i'm starting to get values of 2 and 4 which don't make sense. Note, i'm referring to when I ran this on the actual dataset. Do you have any suggestions for improvements or alternative solutions?

Use <code>mask</code>: <pre class="prettyprint"><code>df['X'] = df['X'].mask(df.Y == 1, 0) df[['X', 'Y']] = df[['X', 'Y']].mask(df.Z == 1, 0) </code></pre> Another solution with <code>DataFrame.loc</code>: <pre class="prettyprint"><code>df.loc[df.Y == 1, 'X'] = 0 df.loc[df.Z == 1, ['X', 'Y']] = 0 print (df) X Y Z 0 0 0 1 1 0 1 0 2 0 0 1 </code></pre>

You can generalize this to wanting the last index of <code>1</code> per row to remain <code>1</code>, and leave everything else as <code>0</code>. For performance operate on the underlying <code>numpy</code> array: <pre class="prettyprint"><code>a = df.values idx = (a.shape[1] - a[:, ::-1].argmax(1)) - 1 t = np.zeros(a.shape) t[np.arange(a.shape[0]), idx] = 1 </code></pre> <pre class="prettyprint"><code>array([[0., 0., 1.], [0., 1., 0.], [0., 0., 1.]]) </code></pre> <hr> If you need the result back as a DataFrame: <pre class="prettyprint"><code>pd.DataFrame(t, columns=df.columns, index=df.index).astype(int) </code></pre> <pre class="prettyprint"><code> X Y Z 0 0 0 1 1 0 1 0 2 0 0 1 </code></pre>

Pandas - Replace other columns in row with 0 if a specific column has a value of 1

Tags:

python

pandas

Here is an example dataframe:

Now, here is the rule I've come up with:

X is left as is
If Y is equal to 1 set the corresponding value in X to 0
If Z is equal to 1 set the corresponding value in X and Y to 0

The final dataframe should look like this:

My first thought at a solution is this:

df_null_list = ['X']

for i in ['Y', 'Z']:

    df[df[i] == 1][df_null_list] = 0

    df_null_list.append(i)

When I do this and sum across the y axis, i'm starting to get values of 2 and 4 which don't make sense. Note, i'm referring to when I ran this on the actual dataset.

Do you have any suggestions for improvements or alternative solutions?

349

asked Nov 04 '18 16:11

madsthaks

2 Answers

Use mask:

df['X'] = df['X'].mask(df.Y == 1, 0)
df[['X', 'Y']] = df[['X', 'Y']].mask(df.Z == 1, 0)

Another solution with DataFrame.loc:

df.loc[df.Y == 1, 'X'] = 0
df.loc[df.Z == 1, ['X', 'Y']] = 0

print (df)
   X  Y  Z
0  0  0  1
1  0  1  0
2  0  0  1

117

answered Oct 19 '22 22:10

jezrael

You can generalize this to wanting the last index of 1 per row to remain 1, and leave everything else as 0. For performance operate on the underlying numpy array:

a = df.values
idx = (a.shape[1] - a[:, ::-1].argmax(1)) - 1
t = np.zeros(a.shape)
t[np.arange(a.shape[0]), idx] = 1

array([[0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]])

If you need the result back as a DataFrame:

pd.DataFrame(t, columns=df.columns, index=df.index).astype(int)

answered Oct 19 '22 21:10

user3483203

Related questions
                            
                                Airflow BashOperator doesn't work but PythonOperator does
                            
                                Node Error spawn /bin/sh ENOENT on remote server
                            
                                Django unit test wait for database
                            
                                Setting row edge color of matplotlib table
                            
                                Converting numpy array of strings to datetime
                            
                                TensorFlow's Print is not printing
                            
                                Installing Python 3.7 on Freebsd 11 with ssl
                            
                                Pandas DataFrame - How to retrieve specific combinations of MultiIndex levels
                            
                                Python: Lock directory
                            
                                Object of type "datetime.date" has no len ()" in python
                            
                                Virtualenv doesn't use right version of Python
                            
                                How to terminate long-running computation (CPU bound task) in Python using asyncio and concurrent.futures.ProcessPoolExecutor?
                            
                                google colab setting a '^C' in the proccess
                            
                                Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float
                            
                                How do i click an element using selenium from a long drop down list?
                            
                                Why does math.isclose() fail to detect minor differences between very large values?
                            
                                Pass command line arguments to test modules
                            
                                pip failling to install for Python 3.7 on MacOs
                            
                                deploying the Tensorflow model in Python
                            
                                Pandas: Group by bi-monthly date field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With