I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what <code>np.isnan</code> is for. However, <code>np.isnan</code> isn't friendly to arrays of data type <code>object</code> (or any string data type): <pre class="prettyprint"><code>>>> str_arr = np.array(["A", "B", "C"]) >>> np.isnan(str_arr) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Not implemented for this type >>> obj_arr = np.array([1, 2, "A"], dtype=object) >>> np.isnan(obj_arr) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' </code></pre> What I would like to get out of these two calls is simply <code>np.array([False, False, False])</code>. I can't just put <code>try</code> and <code>except TypeError</code> around my call to <code>np.isnan</code> and assume that any array that generates a <code>TypeError</code> does not contain NaNs: after all, I'd like <code>np.isnan(np.array([1, np.NaN, "A"]))</code> to return <code>np.array([False, True, False])</code>. My current solution is to make a new array, of type <code>np.float64</code>, loop through the elements of the original array, <code>try</code>ing to put that element in the new array (and if it fails, leave it as zero) and then calling <code>np.isnan</code> on the new array. However, this is of course rather slow. (At least, for large object arrays.) <pre class="prettyprint"><code>def isnan(arr): if isinstance(arr, np.ndarray) and (arr.dtype == object): # Create a new array of dtype float64, fill it with the same values as the input array (where possible), and # then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on # every element in the input array.) new_arr = np.zeros((len(arr),), dtype=np.float64) for idx in xrange(len(arr)): try: new_arr[idx] = arr[idx] except Exception: pass return np.isnan(new_arr) else: try: return np.isnan(arr) except TypeError: return False </code></pre> This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the <code>for</code> loop run over an arbitrary number of dimensions. Is there a more efficient way to figure out which elements in an <code>object</code>-type array are NaN? EDIT: I'm running Python 2.7.10. Note that <code>[x is np.nan for x in np.array([np.nan])]</code> returns <code>False</code>: <code>np.nan</code> is not always the same object in memory as a different <code>np.nan</code>. I do not want the string <code>"nan"</code> to be considered equivalent to <code>np.nan</code>: I want <code>isnan(np.array(["nan"], dtype=object))</code> to return <code>np.array([False])</code>. The multi-dimensionality isn't a big issue. (It's nothing that a little <code>ravel</code>-and-<code>reshape</code>ing won't fix. :p) Any function that relies on the <code>is</code> operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the <code>is</code> operator actually does!)

If you are willing to use the pandas library, a handy function that cover this case is pd.isnull: <blockquote> <code>pandas.isnull(obj)</code> Detect missing values (NaN in numeric arrays, None/NaN in object arrays) </blockquote> Here is an example: <pre class="prettyprint"><code>$ python >>> import numpy >>> import pandas >>> array = numpy.asarray(['a', float('nan')], dtype=object) >>> pandas.isnull(array) array([False, True]) </code></pre>

You could just use a list comp to get the indexes of any nan's which may be faster in this case: <pre class="prettyprint"><code>obj_arr = np.array([1, 2, np.nan, "A"], dtype=object) inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"] </code></pre> Or if you want a boolean mask: <pre class="prettyprint"><code>mask = [True if str(n) == "nan" else False for n in obj_arr] </code></pre> Using <code>is np.nan</code> also seems to work without needing to cast to str: <pre class="prettyprint"><code>In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object) In [30]: [x is np.nan for x in obj_arr] Out[30]: [False, False, True, False] </code></pre> For flat and multidimensional arrays you could check the shape: <pre class="prettyprint"><code>def masks(a): if len(a.shape) > 1: return [[x is np.nan for x in sub] for sub in a] return [x is np.nan for x in a] </code></pre> If is np.nan can fail maybe check the type then us np.isnan <pre class="prettyprint"><code>def masks(a): if len(a.shape) > 1: return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr] return [isinstance(x, float) and np.isnan(x) for x in arr] </code></pre> Interestingly <code>x is np.nan</code> seems to work fine when the data type is object: <pre class="prettyprint"><code>In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object) In [77]: [x is np.nan for x in arr] Out[77]: [True, True, False] In [78]: arr = np.array([np.nan,np.nan,"3"]) In [79]: [x is np.nan for x in arr] Out[79]: [False, False, False] </code></pre> depending on the dtype different things happen: <pre class="prettyprint"><code>In [90]: arr = np.array([np.nan,np.nan,"3"]) In [91]: arr.dtype Out[91]: dtype('S32') In [92]: arr Out[92]: array(['nan', 'nan', '3'], dtype='|S32') In [93]: [x == "nan" for x in arr] Out[93]: [True, True, False] In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object) In [95]: arr.dtype Out[95]: dtype('O') In [96]: arr Out[96]: array([nan, nan, '3'], dtype=object) In [97]: [x == "nan" for x in arr] Out[97]: [False, False, False] </code></pre> Obviously the nan's get coerced to <code>numpy.string_'s</code> when you have strings in your array so <code>x == "nan"</code> works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.

np.isnan on arrays of dtype "object"

Tags:

python

numpy

I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what np.isnan is for.

However, np.isnan isn't friendly to arrays of data type object (or any string data type):

>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

What I would like to get out of these two calls is simply np.array([False, False, False]). I can't just put try and except TypeError around my call to np.isnan and assume that any array that generates a TypeError does not contain NaNs: after all, I'd like np.isnan(np.array([1, np.NaN, "A"])) to return np.array([False, True, False]).

My current solution is to make a new array, of type np.float64, loop through the elements of the original array, trying to put that element in the new array (and if it fails, leave it as zero) and then calling np.isnan on the new array. However, this is of course rather slow. (At least, for large object arrays.)

def isnan(arr):
    if isinstance(arr, np.ndarray) and (arr.dtype == object):
        # Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
        # then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
        # every element in the input array.)
        new_arr = np.zeros((len(arr),), dtype=np.float64)
        for idx in xrange(len(arr)):
            try:
                new_arr[idx] = arr[idx]
            except Exception:
                pass
        return np.isnan(new_arr)
    else:
        try:
            return np.isnan(arr)
        except TypeError:
            return False

This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the for loop run over an arbitrary number of dimensions.

Is there a more efficient way to figure out which elements in an object-type array are NaN?

EDIT: I'm running Python 2.7.10.

Note that [x is np.nan for x in np.array([np.nan])] returns False: np.nan is not always the same object in memory as a different np.nan.

I do not want the string "nan" to be considered equivalent to np.nan: I want isnan(np.array(["nan"], dtype=object)) to return np.array([False]).

The multi-dimensionality isn't a big issue. (It's nothing that a little ravel-and-reshapeing won't fix. :p)

Any function that relies on the is operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the is operator actually does!)

741

asked Mar 24 '16 10:03

acdr

2 Answers

If you are willing to use the pandas library, a handy function that cover this case is pd.isnull:

pandas.isnull(obj)

Detect missing values (NaN in numeric arrays, None/NaN in object arrays)

Here is an example:

$ python
>>> import numpy   
>>> import pandas
>>> array = numpy.asarray(['a', float('nan')], dtype=object)
>>> pandas.isnull(array)
array([False,  True])

112

answered Nov 13 '22 09:11

jII

You could just use a list comp to get the indexes of any nan's which may be faster in this case:

obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]

Or if you want a boolean mask:

mask = [True if str(n) == "nan" else False for n in obj_arr]

Using is np.nan also seems to work without needing to cast to str:

In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]

For flat and multidimensional arrays you could check the shape:

def masks(a):
    if len(a.shape) > 1:
        return [[x is np.nan for x in sub] for sub in a]
    return [x is np.nan for x in a]

If is np.nan can fail maybe check the type then us np.isnan

def masks(a):
    if len(a.shape) > 1:
        return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
    return [isinstance(x, float) and np.isnan(x)  for x in arr]

Interestingly x is np.nan seems to work fine when the data type is object:

In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [77]: [x is np.nan  for x in arr]
Out[77]: [True, True, False]

In [78]: arr = np.array([np.nan,np.nan,"3"])

In [79]: [x is np.nan  for x in arr]
Out[79]: [False, False, False]

depending on the dtype different things happen:

In [90]: arr = np.array([np.nan,np.nan,"3"])

In [91]: arr.dtype
Out[91]: dtype('S32')

In [92]: arr
Out[92]: 
array(['nan', 'nan', '3'], 
      dtype='|S32')

In [93]: [x == "nan"  for x in arr]
Out[93]: [True, True, False]

In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [95]: arr.dtype
Out[95]: dtype('O')

In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)

In [97]: [x == "nan"  for x in arr]
Out[97]: [False, False, False]

Obviously the nan's get coerced to numpy.string_'s when you have strings in your array so x == "nan" works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.

answered Nov 13 '22 10:11

Padraic Cunningham

Related questions
                            
                                List of References in Google App Engine for Python
                            
                                ReportLab: How to align a textobject?
                            
                                Can i set float128 as the standard float-array in numpy
                            
                                Chunking data from a large file for multiprocessing?
                            
                                Read CSV from within Zip File
                            
                                apt-get install for different python versions
                            
                                numpy.shape gives inconsistent responses - why?
                            
                                Why does numpy.r_ use brackets instead of parentheses?
                            
                                python sqlite insert named parameters or null
                            
                                Creating a tree/deeply nested dict from an indented text file in python
                            
                                How do I crop to largest interior bounding box in OpenCV?
                            
                                Pip doesn't install latest available version from pypi (argparse in this case)
                            
                                Creating same random number sequence in Python, NumPy and R
                            
                                How to get SQLite result/error codes in Python
                            
                                How to solve the 10054 error
                            
                                Retrieve the command line arguments of the Python interpreter
                            
                                Most efficient way to remove multiple substrings from string?
                            
                                Customize location of .so file generated by Cython
                            
                                How to cope with the performance of generating signed URLs for accessing private content via CloudFront?
                            
                                In locust How to get a response from one task and pass it to other task

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With