<code>pd.NA</code> vs <code>np.nan</code> for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas? Some sample code that uses them both: <pre class="prettyprint"><code>import pandas as pd import numpy as np df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA], 'numeric': [1, 2, np.nan , 4], 'categorical': pd.Categorical(['d', np.nan,'f', 'g']) }) </code></pre> output: <pre class="prettyprint"><code>| | object | numeric | categorical | |---:|:---------|----------:|:--------------| | 0 | a | 1 | d | | 1 | b | 2 | nan | | 2 | c | nan | f | | 3 | <NA> | 4 | g | </code></pre>

As of now (release of pandas-1.0.0) I would really recommend to use it carefully. First, it's still an experimental feature: <blockquote> Experimental: the behaviour of <code>pd.NA</code> can still change without warning. </blockquote> Second, the behaviour differs from <code>np.nan</code>: <blockquote> Compared to <code>np.nan</code>, <code>pd.NA</code> behaves differently in certain operations. In addition to arithmetic operations, <code>pd.NA</code> also propagates as “missing” or “unknown” in comparison operations. </blockquote> Both quotas from release-notes To show some additional example, I was surprised with interpolation behaviour: Create simple DataFrame: <pre class="prettyprint lang-py prettyprint-override"><code>df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]}) df # a b # 0 0 0.0 # 1 <NA> NaN # 2 2 2.0 </code></pre> and try to interpolate: <pre class="prettyprint lang-py prettyprint-override"><code>df.interpolate() # a b # 0 0 0.0 # 1 <NA> 1.0 # 2 2 2.0 </code></pre> There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences - It is an experimental feature and it behaves differently in some cases. I think it will be very useful feature, but I would be really careful with statements like "It should be completely fine to use it instead of <code>np.nan</code>". It might be true for most cases, but can cause some troubles when you are not aware of it.

pd.NA vs np.nan for pandas

Tags:

python

pandas

dataframe

numpy

pd.NA vs np.nan for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas?

Some sample code that uses them both:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA],
                   'numeric': [1, 2, np.nan , 4],
                    'categorical': pd.Categorical(['d', np.nan,'f', 'g'])
                 })

output:

|    | object   |   numeric | categorical   |
|---:|:---------|----------:|:--------------|
|  0 | a        |         1 | d             |
|  1 | b        |         2 | nan           |
|  2 | c        |       nan | f             |
|  3 | <NA>     |         4 | g             |

996

asked Feb 07 '20 14:02

vasili111

1 Answers

As of now (release of pandas-1.0.0) I would really recommend to use it carefully.

First, it's still an experimental feature:

Experimental: the behaviour of pd.NA can still change without warning.

Second, the behaviour differs from np.nan:

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations.

Both quotas from release-notes

To show some additional example, I was surprised with interpolation behaviour:

Create simple DataFrame:

df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]})
df
#       a    b
# 0     0  0.0
# 1  <NA>  NaN
# 2     2  2.0

and try to interpolate:

df.interpolate()
#       a    b
# 0     0  0.0
# 1  <NA>  1.0
# 2     2  2.0

There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences - It is an experimental feature and it behaves differently in some cases.

I think it will be very useful feature, but I would be really careful with statements like "It should be completely fine to use it instead of np.nan". It might be true for most cases, but can cause some troubles when you are not aware of it.

114

answered Sep 21 '22 19:09

Nerxis

Related questions
                            
                                argparse: flatten the result of action='append'
                            
                                Paramiko: "FutureWarning: CTR mode needs counter parameter"
                            
                                How to get the coordinates of the bounding box in YOLO object detection?
                            
                                mock_s3 decorating pytest fixture
                            
                                python multiprocessing in Jupyter on Windows: AttributeError: Can't get attribute "abc"
                            
                                Count occurrences of a substring in a list of strings
                            
                                Multiply doubles in Python with same precision as C++
                            
                                Getting around tf.argmax which is not differentiable
                            
                                Getting error 403 while installing package with pip
                            
                                Python Logging - How to inherit root logger level & handler
                            
                                (Tensorflow-GPU) import tensorflow ImportError: Could not find 'cudnn64_7.dll'
                            
                                Django: Filter a Queryset made of unions not working
                            
                                pandas dataframe: loc vs query performance
                            
                                Python unpacking operator (*)
                            
                                google.auth.exceptions.DefaultCredentialsError:
                            
                                Seaborn lineplot high cpu; very slow compared to matplotlib
                            
                                selenium.common.exceptions.WebDriverException: Message: invalid session id using Selenium with ChromeDriver and Chrome through Python
                            
                                Filter data in pytorch tensor
                            
                                pdb step into a function when already in pdb mode
                            
                                Why do I get an 'Unhandled exception in event loop' error on ipython

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With