This is somewhat of a broad topic, but I will try to pare it to some specific questions.
In starting to answer questions on SO, I have found myself sometimes running into a silly error like this when making toy data:
In[0]:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = np.nan
Out[0]:
NameError: name 'np' is not defined
I'm so used to automatically importing numpy
with pandas
that this doesn't usually occur in real code. However, it did make me wonder why pandas
doesn't have it's own value/object for representing null values.
I only recently realized that you could just use the Python None
instead for a similar situation:
import pandas as pd
df = pd.DataFrame({"values":[1,2,3,4,5,6,7,8,9]})
df[df < 5] = None
Which works as expected and doesn't produce an error. But I have felt like the convention on SO that I have seen is to use np.nan
, and that people are usually referring to np.nan
when discussing null values (this is perhaps why I hadn't realized None
can be used, but maybe that was my own idiosyncrasy).
Briefly looking into this, I have seen now that pandas
does have a pandas.NA
value since 1.0.0, but I have never seen anyone use it in a post:
In[0]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'values':np.random.rand(20,)})
df['above'] = df['values']
df['below'] = df['values']
df['above'][df['values']>0.7] = np.nan
df['below'][df['values']<0.3] = pd.NA
df['names'] = ['a','b','c','a','b','c','a','b','c','a']*2
df.loc[df['names']=='a','names'] = pd.NA
df.loc[df['names']=='b','names'] = np.nan
df.loc[df['names']=='c','names'] = None
df
Out[0]:
values above below names
0 0.323531 0.323531 0.323531 <NA>
1 0.690383 0.690383 0.690383 NaN
2 0.692371 0.692371 0.692371 None
3 0.259712 0.259712 NaN <NA>
4 0.473505 0.473505 0.473505 NaN
5 0.907751 NaN 0.907751 None
6 0.642596 0.642596 0.642596 <NA>
7 0.229420 0.229420 NaN NaN
8 0.576324 0.576324 0.576324 None
9 0.823715 NaN 0.823715 <NA>
10 0.210176 0.210176 NaN <NA>
11 0.629563 0.629563 0.629563 NaN
12 0.481969 0.481969 0.481969 None
13 0.400318 0.400318 0.400318 <NA>
14 0.582735 0.582735 0.582735 NaN
15 0.743162 NaN 0.743162 None
16 0.134903 0.134903 NaN <NA>
17 0.386366 0.386366 0.386366 NaN
18 0.313160 0.313160 0.313160 None
19 0.695956 0.695956 0.695956 <NA>
So it seems that for numerical values, the distinction between these different null values doesn't matter, but they are represented differently for strings (and perhaps for other data types?).
My questions based on the above:
np.nan
(rather than None
) to represent null values in pandas
?pandas
not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?Series
or column, is there any difference between them? Why are they not represented identically (as with numerical data)?I fully anticipate that I may have a flawed interpretation of things and the distinction between pandas
and numpy
, so please correct me.
In applied data science, you will usually have missing data. For example, an industrial application with sensors will have sensor data that is missing on certain days. You have a couple of alternatives to work with missing data.
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame : isnull() notnull()
Using fillna() to NaN/Null Values With Empty String Use pandas. DataFrmae. fillna() to Replace NaN/Null values with an empty string. This replaces each NaN in pandas DataFrame with an empty string.
A main dependency of pandas
is numpy
, in other words, pandas is built on-top of numpy. Because pandas inherits and uses many of the numpy methods, it makes sense to keep things consistent, that is, missing numeric data are represented with np.NaN
.
(This choice to build upon numpy has consequences for other things too. For instance date and time operations are built upon the np.timedelta64
and np.datetime64
dtypes, not the standard datetime
module.)
One thing you may not have known is that numpy
has always been there with pandas
import pandas as pd
pd.np?
pd.np.nan
Though you might think this behavior could be better since you don't import numpy, this is discouraged and in the near future will be deprecated in favor of directly importing numpy
FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead
Is it conventional to use np.nan
(rather than None
) to represent null values in pandas?
If the data are numeric then yes, you should use np.NaN
. None
requires the dtype to be Object
and with pandas you want numeric data stored in a numeric dtype. pandas
will generally coerce to the proper null-type upon creation or import so that it can use the correct dtype
pd.Series([1, None])
#0 1.0
#1 NaN <- None became NaN so it can have dtype: float64
#dtype: float64
Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?
pandas
did not have it's own null value because it got by with np.NaN
, which worked for the majority of circumstances. However with pandas
it's very common to have missing data, an entire section of the documentation is devoted to this. NaN
, being a float, does not fit into an integer container which means that any numeric Series with missing data is upcast to float
. This can become problematic because of floating point math, and some integers cannot be represented perfectly with by a floating point number. As a result, any joins or merges
could possibly fail.
# Gets upcast to float
pd.Series([1,2,np.NaN])
#0 1.0
#1 2.0
#2 NaN
#dtype: float64
# Can safely do merges/joins/math because things are still Int
pd.Series([1,2,np.NaN]).astype('Int64')
#0 1
#1 2
#2 <NA>
#dtype: Int64
filter-function
that returns only one value, let's say None
.numpy
calculations or so on. So, the pandas
nan means something different. Maybe, it does not make sense here in your special case, but it will have a meaning in other cases.That's a great question! My hunch is that this has to do with the fact that NumPy functions are implemented in C which makes it so fast. Python's None might not give you the same efficiency (or is probably translated into np.nan), while Pandas's pd.NA would likely be translated into NumPy's np.nan anyway, since Pandas requires NumPy. Haven't found resources to support my claims yet, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With