Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

np.isnan on arrays of dtype "object"

Tags:

python

numpy

I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what np.isnan is for.

However, np.isnan isn't friendly to arrays of data type object (or any string data type):

>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

What I would like to get out of these two calls is simply np.array([False, False, False]). I can't just put try and except TypeError around my call to np.isnan and assume that any array that generates a TypeError does not contain NaNs: after all, I'd like np.isnan(np.array([1, np.NaN, "A"])) to return np.array([False, True, False]).

My current solution is to make a new array, of type np.float64, loop through the elements of the original array, trying to put that element in the new array (and if it fails, leave it as zero) and then calling np.isnan on the new array. However, this is of course rather slow. (At least, for large object arrays.)

def isnan(arr):
    if isinstance(arr, np.ndarray) and (arr.dtype == object):
        # Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
        # then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
        # every element in the input array.)
        new_arr = np.zeros((len(arr),), dtype=np.float64)
        for idx in xrange(len(arr)):
            try:
                new_arr[idx] = arr[idx]
            except Exception:
                pass
        return np.isnan(new_arr)
    else:
        try:
            return np.isnan(arr)
        except TypeError:
            return False

This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the for loop run over an arbitrary number of dimensions.

Is there a more efficient way to figure out which elements in an object-type array are NaN?

EDIT: I'm running Python 2.7.10.

Note that [x is np.nan for x in np.array([np.nan])] returns False: np.nan is not always the same object in memory as a different np.nan.

I do not want the string "nan" to be considered equivalent to np.nan: I want isnan(np.array(["nan"], dtype=object)) to return np.array([False]).

The multi-dimensionality isn't a big issue. (It's nothing that a little ravel-and-reshapeing won't fix. :p)

Any function that relies on the is operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the is operator actually does!)

like image 741
acdr Avatar asked Mar 24 '16 10:03

acdr


People also ask

How do you find NaN in an array?

To check for NaN values in a Numpy array you can use the np. isnan() method. This outputs a boolean mask of the size that of the original array. The output array has true for the indices which are NaNs in the original array and false for the rest.

How do I check if a numpy array has NaN?

To test array for NaN, use the numpy. isnan() method in Python Numpy. Returns True where x is NaN, false otherwise. This is a scalar if x is a scalar.

How do you check if an element is an array NP?

Using Numpy array, we can easily find whether specific values are present or not. For this purpose, we use the “in” operator. “in” operator is used to check whether certain element and values are present in a given sequence and hence return Boolean values 'True” and “False“.

What is Isnan array?

Numpy isnan() function returns a Boolean array, which has the result if we pass the array and Boolean value true or false if we pass a scalar value according to the value passed.


2 Answers

If you are willing to use the pandas library, a handy function that cover this case is pd.isnull:

pandas.isnull(obj)

Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

Here is an example:

$ python
>>> import numpy   
>>> import pandas
>>> array = numpy.asarray(['a', float('nan')], dtype=object)
>>> pandas.isnull(array)
array([False,  True])
like image 112
jII Avatar answered Nov 13 '22 09:11

jII


You could just use a list comp to get the indexes of any nan's which may be faster in this case:

obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]

Or if you want a boolean mask:

mask = [True if str(n) == "nan" else False for n in obj_arr]

Using is np.nan also seems to work without needing to cast to str:

In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]

For flat and multidimensional arrays you could check the shape:

def masks(a):
    if len(a.shape) > 1:
        return [[x is np.nan for x in sub] for sub in a]
    return [x is np.nan for x in a]

If is np.nan can fail maybe check the type then us np.isnan

def masks(a):
    if len(a.shape) > 1:
        return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
    return [isinstance(x, float) and np.isnan(x)  for x in arr]

Interestingly x is np.nan seems to work fine when the data type is object:

In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [77]: [x is np.nan  for x in arr]
Out[77]: [True, True, False]

In [78]: arr = np.array([np.nan,np.nan,"3"])

In [79]: [x is np.nan  for x in arr]
Out[79]: [False, False, False]

depending on the dtype different things happen:

In [90]: arr = np.array([np.nan,np.nan,"3"])

In [91]: arr.dtype
Out[91]: dtype('S32')

In [92]: arr
Out[92]: 
array(['nan', 'nan', '3'], 
      dtype='|S32')

In [93]: [x == "nan"  for x in arr]
Out[93]: [True, True, False]

In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [95]: arr.dtype
Out[95]: dtype('O')

In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)

In [97]: [x == "nan"  for x in arr]
Out[97]: [False, False, False]

Obviously the nan's get coerced to numpy.string_'s when you have strings in your array so x == "nan" works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.

like image 45
Padraic Cunningham Avatar answered Nov 13 '22 10:11

Padraic Cunningham