Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping NaNs with pandas dataframe inequalities

Tags:

python

pandas

I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained.

If there are no NaN values, it takes about 60 ms for me to run:

df >= threshold

But when I try to deal with the NaNs, the below method works, but is very slow (20 sec).

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

Is there a faster way?

like image 883
jsignell Avatar asked Nov 18 '15 14:11

jsignell


People also ask

How fill NaN values in pandas with different values?

Either you can drop rows with NaN values using pandas. DataFrame. dropna() or handle NaN by filling with specific values using fillna() method. It is used to fill NaN values with specified values (0, blank, e.t.c).

Does pandas sum ignore NaN?

Example 3: Dataframe. The default value of skipna parameter is True, so if we call the sum() function without skipna parameter then it skips all the NaN values by default.


1 Answers

You can do:

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get object dtype with NaN, False, and True.

Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!

The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.

like image 57
ocefpaf Avatar answered Oct 06 '22 00:10

ocefpaf