Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas equivalent of np.where

np.where has the semantics of a vectorized if/else (similar to Apache Spark's when/otherwise DataFrame method). I know that I can use np.where on pandas.Series, but pandas often defines its own API to use instead of raw numpy functions, which is usually more convenient with pd.Series/pd.DataFrame.

Sure enough, I found pandas.DataFrame.where. However, at first glance, it has completely different semantics. I could not find a way to rewrite the most basic example of np.where using pandas where:

# df is pd.DataFrame # how to write this using df.where? df['C'] = np.where((df['A']<0) | (df['B']>0), df['A']+df['B'], df['A']/df['B']) 

Am I missing something obvious? Or is pandas' where intended for a completely different use case, despite same name as np.where?

like image 809
max Avatar asked Jul 26 '16 00:07

max


People also ask

Can you use NP where on a pandas DataFrame?

where() You can use the NumPy where() function to quickly update the values in a NumPy array using if-else logic. If a given value in the array was less than 5 or greater than 8, we divided the value by 2.

Where are DF pandas?

Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value. Parameters: cond: One or more condition to check data frame for.

Is at and LOC same in pandas?

at is a single element and using . loc maybe a Series or a DataFrame. Returning single value is not the case always. It returns array of values if the provided index is used multiple times.

Is Panda faster than NP?

NumPy can be said to be faster in performance than Pandas, up to fifty thousand rows and less of the dataset. (The performance between fifty thousand rows to five hundred thousand rows mostly depends on the type of operation Pandas, and NumPy are going to have to perform.)


1 Answers

Try:

(df['A'] + df['B']).where((df['A'] < 0) | (df['B'] > 0), df['A'] / df['B']) 

The difference between the numpy where and DataFrame where is that the default values are supplied by the DataFrame that the where method is being called on (docs).

I.e.

np.where(m, A, B) 

is roughly equivalent to

A.where(m, B) 

If you wanted a similar call signature using pandas, you could take advantage of the way method calls work in Python:

pd.DataFrame.where(cond=(df['A'] < 0) | (df['B'] > 0), self=df['A'] + df['B'], other=df['A'] / df['B']) 

or without kwargs (Note: that the positional order of arguments is different from the numpy where argument order):

pd.DataFrame.where(df['A'] + df['B'], (df['A'] < 0) | (df['B'] > 0), df['A'] / df['B']) 
like image 95
Alex Avatar answered Sep 21 '22 07:09

Alex