Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

vectorize conditional assignment in pandas dataframe

If I have a dataframe df with column x and want to create column y based on values of x using this in pseudo code:

if df['x'] < -2 then df['y'] = 1  else if df['x'] > 2 then df['y'] = -1  else df['y'] = 0 

How would I achieve this? I assume np.where is the best way to do this but not sure how to code it correctly.

like image 730
azuric Avatar asked Mar 06 '15 10:03

azuric


People also ask

How do I apply a condition in pandas?

1) Applying IF condition on NumbersIf the particular number is equal or lower than 53, then assign the value of 'True'. Otherwise, if the number is greater than 53, then assign the value of 'False'.

What are vectorized operations in pandas?

In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.


2 Answers

One simple method would be to assign the default value first and then perform 2 loc calls:

In [66]:  df = pd.DataFrame({'x':[0,-3,5,-1,1]}) df Out[66]:    x 0  0 1 -3 2  5 3 -1 4  1  In [69]:  df['y'] = 0 df.loc[df['x'] < -2, 'y'] = 1 df.loc[df['x'] > 2, 'y'] = -1 df Out[69]:    x  y 0  0  0 1 -3  1 2  5 -1 3 -1  0 4  1  0 

If you wanted to use np.where then you could do it with a nested np.where:

In [77]:  df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0)) df Out[77]:    x  y 0  0  0 1 -3  1 2  5 -1 3 -1  0 4  1  0 

So here we define the first condition as where x is less than -2, return 1, then we have another np.where which tests the other condition where x is greater than 2 and returns -1, otherwise return 0

timings

In [79]:  %timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))  1000 loops, best of 3: 1.79 ms per loop  In [81]:  %%timeit df['y'] = 0 df.loc[df['x'] < -2, 'y'] = 1 df.loc[df['x'] > 2, 'y'] = -1  100 loops, best of 3: 3.27 ms per loop 

So for this sample dataset the np.where method is twice as fast

like image 132
EdChum Avatar answered Sep 28 '22 02:09

EdChum


This is a good use case for pd.cut where you define ranges and based on those ranges you can assign labels:

df['y'] = pd.cut(df['x'], [-np.inf, -2, 2, np.inf], labels=[1, 0, -1], right=False) 

Output

   x  y 0  0  0 1 -3  1 2  5 -1 3 -1  0 4  1  0 
like image 35
Erfan Avatar answered Sep 28 '22 02:09

Erfan