I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained. If there are no NaN values, it takes about 60 ms for me to run: <pre class="prettyprint"><code>df >= threshold </code></pre> But when I try to deal with the NaNs, the below method works, but is very slow (20 sec). <pre class="prettyprint"><code>def func(x): if x >= threshold: return True elif x < threshold: return False else: return x df.apply(lambda x: x.apply(lambda x: func(x))) </code></pre> Is there a faster way?

You can do: <pre class="prettyprint"><code>new_df = df >= threshold new_df[df.isnull()] = np.NaN </code></pre> But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get <code>object</code> dtype with NaN, False, and True. Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN! The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.

Keeping NaNs with pandas dataframe inequalities

Tags:

python

pandas

I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is greater than the threshold, False means that it is less, and NaN values are maintained.

If there are no NaN values, it takes about 60 ms for me to run:

df >= threshold

But when I try to deal with the NaNs, the below method works, but is very slow (20 sec).

def func(x):
    if x >= threshold:
        return True
    elif x < threshold:
        return False
    else:
        return x
df.apply(lambda x: x.apply(lambda x: func(x)))

Is there a faster way?

883

asked Nov 18 '15 14:11

jsignell

1 Answers

You can do:

new_df = df >= threshold
new_df[df.isnull()] = np.NaN

But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get object dtype with NaN, False, and True.

Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!

The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.

answered Oct 06 '22 00:10

ocefpaf

Related questions
                            
                                Overwriting previously extracted files instead of creating new ones
                            
                                Trouble installing scipy via pyCharm windows 8 - no lapack / blas resources found
                            
                                Function decorated using functools.wraps raises TypeError with the name of the wrapper. Why? How to avoid?
                            
                                Python 3 map dictionary update method to a list of other dictionaries [duplicate]
                            
                                plt.show() hangs on OSX with Anaconda Python
                            
                                Enabling compression on Heroku using python
                            
                                How can I control which Python distribution to pip install a package to when I have Python 2, Python 3, and Anaconda on my computer?
                            
                                IntegrityError Insert or update on table "orders_order" violates foreign key constraint "
                            
                                flask admin custom QueryAjaxModelLoader
                            
                                Is there a way to make Seaborn or Vincent interactive?
                            
                                How to plot kernel density plot of dates in Pandas?
                            
                                Passing default arguments to a decorator in python
                            
                                when does `datetime.now(pytz_timezone)` fail?
                            
                                Why is this generator expression function slower than the loop version?
                            
                                Running django tests with selenium in docker
                            
                                Polymorphic Model Inheritance in Django
                            
                                Joining ManyToMany fields with prefetch_related in Django
                            
                                Detect all global variables within a python function?
                            
                                Python strptime parsing year without century: assume prior to this year?
                            
                                Cython: using imported class in a type declaration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With