I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained.
If there are no NaN values, it takes about 60 ms for me to run:
df >= threshold
But when I try to deal with the NaNs, the below method works, but is very slow (20 sec).
def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))
Is there a faster way?
Either you can drop rows with NaN values using pandas. DataFrame. dropna() or handle NaN by filling with specific values using fillna() method. It is used to fill NaN values with specified values (0, blank, e.t.c).
Example 3: Dataframe. The default value of skipna parameter is True, so if we call the sum() function without skipna parameter then it skips all the NaN values by default.
You can do:
new_df = df >= threshold
new_df[df.isnull()] = np.NaN
But that is different from what you will get using the apply method.  Here your mask has float dtype containing NaN, 0.0 and 1.0.  In the apply solution you get object dtype with NaN, False, and True.
Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!
The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With