Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pd.NA vs np.nan for pandas

pd.NA vs np.nan for pandas. Which one to use with pandas and why to use? What are main advantages and disadvantages of each of them with pandas?

Some sample code that uses them both:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c',pd.NA],
                   'numeric': [1, 2, np.nan , 4],
                    'categorical': pd.Categorical(['d', np.nan,'f', 'g'])
                 })

output:

|    | object   |   numeric | categorical   |
|---:|:---------|----------:|:--------------|
|  0 | a        |         1 | d             |
|  1 | b        |         2 | nan           |
|  2 | c        |       nan | f             |
|  3 | <NA>     |         4 | g             |
like image 996
vasili111 Avatar asked Feb 07 '20 14:02

vasili111


People also ask

Does pandas use NP NaN?

Note that pandas/NumPy uses the fact that np. nan != np.

What's NaN in pandas?

The official documentation for pandas defines what most developers would know as null values as missing or missing data in pandas. Within pandas, a missing value is denoted by NaN .

Is NaN and Na the same in Python?

The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.

Is NaN and NP NaN same?

nan is a single object that always has the same id, no matter which variable you assign it to. np. nan is np. nan is True and one is two is also True .


1 Answers

As of now (release of pandas-1.0.0) I would really recommend to use it carefully.

First, it's still an experimental feature:

Experimental: the behaviour of pd.NA can still change without warning.

Second, the behaviour differs from np.nan:

Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations.

Both quotas from release-notes

To show some additional example, I was surprised with interpolation behaviour:

Create simple DataFrame:

df = pd.DataFrame({"a": [0, pd.NA, 2], "b": [0, np.nan, 2]})
df
#       a    b
# 0     0  0.0
# 1  <NA>  NaN
# 2     2  2.0

and try to interpolate:

df.interpolate()
#       a    b
# 0     0  0.0
# 1  <NA>  1.0
# 2     2  2.0

There are some reasons for that (I am still discovering that), anyway, I just want to highlighted those differences - It is an experimental feature and it behaves differently in some cases.

I think it will be very useful feature, but I would be really careful with statements like "It should be completely fine to use it instead of np.nan". It might be true for most cases, but can cause some troubles when you are not aware of it.

like image 114
Nerxis Avatar answered Sep 21 '22 19:09

Nerxis