Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Truth value of numpy array with one falsey element seems to depend on dtype

Tags:

import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])

Why should we have this inconsistency:

>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False
like image 619
wim Avatar asked May 05 '15 03:05

wim


2 Answers

For arrays with one element, the array's truth value is determined by the truth value of that element.

The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).

In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.

In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).

Details of this Boolean evaluation are presented below.


Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.

Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:

return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);

Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.

Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.

As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).

To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:

static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
    int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
    int i;
    npy_bool nonz = NPY_FALSE;

    for (i = 0; i < len; i++) {
        if (!Py_STRING_ISSPACE(*ip)) {   // if it isn't whitespace, it's True
            nonz = NPY_TRUE;
            break;
        }
        ip++;
    }
    return nonz;
}

(The unicode case is more or less the same idea.)

So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:

>>> bool(np.array([' ']))
False

In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:

>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)

The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.

As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.


Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.

like image 196
Alex Riley Avatar answered Oct 05 '22 22:10

Alex Riley


I'm pretty sure the answer is, as explained in Scalars, that:

Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.

So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.

And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.

So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.


So, that explains the basics. But what about the specifics of case c?

First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.

But there seems to be something else weird going on with strings. Consider this:

>>> np.array(['a', 'b']) != 0
True

That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)


Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.

I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.

like image 40
abarnert Avatar answered Oct 06 '22 00:10

abarnert