I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what np.isnan
is for.
However, np.isnan
isn't friendly to arrays of data type object
(or any string data type):
>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type
>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
What I would like to get out of these two calls is simply np.array([False, False, False])
. I can't just put try
and except TypeError
around my call to np.isnan
and assume that any array that generates a TypeError
does not contain NaNs: after all, I'd like np.isnan(np.array([1, np.NaN, "A"]))
to return np.array([False, True, False])
.
My current solution is to make a new array, of type np.float64
, loop through the elements of the original array, try
ing to put that element in the new array (and if it fails, leave it as zero) and then calling np.isnan
on the new array. However, this is of course rather slow. (At least, for large object arrays.)
def isnan(arr):
if isinstance(arr, np.ndarray) and (arr.dtype == object):
# Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
# then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
# every element in the input array.)
new_arr = np.zeros((len(arr),), dtype=np.float64)
for idx in xrange(len(arr)):
try:
new_arr[idx] = arr[idx]
except Exception:
pass
return np.isnan(new_arr)
else:
try:
return np.isnan(arr)
except TypeError:
return False
This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the for
loop run over an arbitrary number of dimensions.
Is there a more efficient way to figure out which elements in an object
-type array are NaN?
EDIT: I'm running Python 2.7.10.
Note that [x is np.nan for x in np.array([np.nan])]
returns False
: np.nan
is not always the same object in memory as a different np.nan
.
I do not want the string "nan"
to be considered equivalent to np.nan
: I want isnan(np.array(["nan"], dtype=object))
to return np.array([False])
.
The multi-dimensionality isn't a big issue. (It's nothing that a little ravel
-and-reshape
ing won't fix. :p)
Any function that relies on the is
operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the is
operator actually does!)
To check for NaN values in a Numpy array you can use the np. isnan() method. This outputs a boolean mask of the size that of the original array. The output array has true for the indices which are NaNs in the original array and false for the rest.
To test array for NaN, use the numpy. isnan() method in Python Numpy. Returns True where x is NaN, false otherwise. This is a scalar if x is a scalar.
Using Numpy array, we can easily find whether specific values are present or not. For this purpose, we use the “in” operator. “in” operator is used to check whether certain element and values are present in a given sequence and hence return Boolean values 'True” and “False“.
Numpy isnan() function returns a Boolean array, which has the result if we pass the array and Boolean value true or false if we pass a scalar value according to the value passed.
If you are willing to use the pandas library, a handy function that cover this case is pd.isnull:
pandas.isnull(obj)
Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
Here is an example:
$ python
>>> import numpy
>>> import pandas
>>> array = numpy.asarray(['a', float('nan')], dtype=object)
>>> pandas.isnull(array)
array([False, True])
You could just use a list comp to get the indexes of any nan's which may be faster in this case:
obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)
inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]
Or if you want a boolean mask:
mask = [True if str(n) == "nan" else False for n in obj_arr]
Using is np.nan
also seems to work without needing to cast to str:
In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)
In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]
For flat and multidimensional arrays you could check the shape:
def masks(a):
if len(a.shape) > 1:
return [[x is np.nan for x in sub] for sub in a]
return [x is np.nan for x in a]
If is np.nan can fail maybe check the type then us np.isnan
def masks(a):
if len(a.shape) > 1:
return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
return [isinstance(x, float) and np.isnan(x) for x in arr]
Interestingly x is np.nan
seems to work fine when the data type is object:
In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)
In [77]: [x is np.nan for x in arr]
Out[77]: [True, True, False]
In [78]: arr = np.array([np.nan,np.nan,"3"])
In [79]: [x is np.nan for x in arr]
Out[79]: [False, False, False]
depending on the dtype different things happen:
In [90]: arr = np.array([np.nan,np.nan,"3"])
In [91]: arr.dtype
Out[91]: dtype('S32')
In [92]: arr
Out[92]:
array(['nan', 'nan', '3'],
dtype='|S32')
In [93]: [x == "nan" for x in arr]
Out[93]: [True, True, False]
In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)
In [95]: arr.dtype
Out[95]: dtype('O')
In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)
In [97]: [x == "nan" for x in arr]
Out[97]: [False, False, False]
Obviously the nan's get coerced to numpy.string_'s
when you have strings in your array so x == "nan"
works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With