Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nanfunctions and regular functions behaving the same on Pandas type

Normally numpy.var() is different than numpy.nanvar() when there are missing values, the same for numpy.std() and numpy.nanstd(). However:

df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9,10,np.NaN,np.NaN,np.NaN]})

print("np.var() " + " : "+ str(np.var(df["A"])))
print("np.nanvar() " + " : "+ str(np.nanvar(df["A"])))
print("np.std() " + " : "+ str(np.std(df["A"])))
print("np.nanstd() " + " : "+ str(np.nanstd(df["A"])))

Results:

np.var() : 8.25
np.nanvar() : 8.25
np.std() : 2.8722813232690143
np.nanstd() : 2.8722813232690143

Why are both the same? There is nothing about missing values in the documentation of np.var() or np.std().

like image 270
Chris Avatar asked Apr 15 '18 20:04

Chris


1 Answers

This is because numpy.std (resp. numpy.var) tries to delegate to the first argument's std (resp. var) method if it isn't an ndarray (from source code here):

def std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=np._NoValue):
    kwargs = {}
    if keepdims is not np._NoValue:
        kwargs['keepdims'] = keepdims

    if type(a) is not mu.ndarray:
        try:
            std = a.std
        except AttributeError:
            pass
        else:
            return std(axis=axis, dtype=dtype, out=out, ddof=ddof, **kwargs)

    return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
                         **kwargs)

So really, you're just calling pandas.Series.std (with 0 degrees of freedom). And in the Pandas library, all of the descriptive stats functions handle missing values (from docs see Calculations with missing data).

The takeaway here is that it is a lot more clear to use the Pandas data type methods instead of the NumPy free functions in the first place, given you have a Pandas Series.


Comments

This behavior is what NumPy does for many functions with an array-like object as a first argument - try to use the same method on the object should it exist, and if not use some fallback. It isn't always the case though - for instance

>>> a = np.random.randint(0, 100, 5)

>>> a
array([49, 68, 93, 51, 94])

>>> np.sort(a) # not in-place
array([49, 51, 68, 93, 94])

>>> a
array([49, 68, 93, 51, 94])

>>> a.sort() # in-place

>>> a
array([49, 51, 68, 93, 94])

Also, in most cases the NaN handling functions in nanfunctions.py first call _replace_nan, which casts your type to an ndarray, and replaces the NaN values in your ndarray with a value that won't affect whatever calculation they are performing (i.e. np.nansum replaces NaNs with 0, np.nanprod replaces NaNs with 1). They then call their non-NaN counterparts to perform the actual calculation. (ex: np.nansum)

def nansum(a, axis=None, dtype=None, out=None, keepdims=np._NoValue):
    a, mask = _replace_nan(a, 0)
    return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)

So calling np.nansum on a Pandas series for instance, you don't actually end up using pandas.Series.sum because the Series is cast to an ndarray first inside _replace_nan. So don't (I'm not sure why you would) assume or rely on the sum method of your Series being called.

# a silly example

>>> s = pd.Series([1, 2, 3, np.nan])

>>> s.sum = lambda *args, **kwargs: "instance sum"

>>> s.sum()
'instance sum'

>>> np.sum(s)
'instance sum'

>>> np.nansum(s)
6
like image 109
miradulo Avatar answered Oct 15 '22 13:10

miradulo