import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))
I want an expression whose value is a Boolean array, with the same shape of data
(or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data
is in set
.
E.g., if I want to know which elements of data
are strictly less than 6
, I can use a single vectorized expression,
a = data < 6
that computes a 6x8
boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression
b = data in test
what I get is an exception:
TypeError: unhashable type: 'numpy.ndarray'
Edit: the possibility #4 below gives wrong results, thanks to hpaulj and Divakar for getting me on the right track.
Here I compare four different possibilities,
np.in1d(data, np.hstack(test))
.np.in1d(data, np.array(list(test)))
.np.in1d(data, test)
.Here it is the Ipython session, slightly edited to avoid blank lines
In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop
In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop
In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop
In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop
In [18]:
The better times are given by the (now) anonymous poster's answer.
It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!
As commented by hpaulj, in the documentation of in1d
there is a warning against the use of a set
as the second argument, but I'd like better an explicit failure if the computed results could be wrong.
That said, the solution using numpy.fromiter()
has the best numbers...
A set is unordered and each element can only appear once in a set. While an array can contain duplicate elements, each value contained in a set is unique.
Each item in an array is called an element, and each element is accessed by its numerical index. As shown in the preceding illustration, numbering begins with 0. The 9th element, for example, would therefore be accessed at index 8.
One of the biggest differences between an Array and a Set is the order of elements. The documentation describes this as well: Array: “An ordered, random-access collection.” Set: “An unordered collection of unique elements.”
Creating arrays. An array is a variable containing multiple values. Any variable may be used as an array. There is no maximum limit to the size of an array, nor any requirement that member variables be indexed or assigned contiguously.
I am assuming you are looking to find a boolean array to detect the presence of the set
elements in data
array. To do so, you can extract the elements from set
with np.hstack
and then use np.in1d
to detect presence of any element from set
at each position in data
, giving us a boolean array of the same size as data
. Since, np.in1d
flattens the input before processing, so as a final step, we need to reshape the output from np.in1d
back to its original 2D
shape. Thus, the final implementation would be -
np.in1d(data,np.hstack(test)).reshape(data.shape)
Sample run -
In [125]: data
Out[125]:
array([[7, 0, 1, 8, 9, 5, 9, 1],
[9, 7, 1, 4, 4, 2, 4, 4],
[0, 4, 9, 6, 6, 3, 5, 9],
[2, 2, 7, 7, 6, 7, 7, 2],
[3, 4, 8, 4, 2, 1, 9, 8],
[9, 0, 8, 1, 6, 1, 3, 5]])
In [126]: test
Out[126]: {3, 4, 6, 7, 9}
In [127]: np.in1d(data,np.hstack(test)).reshape(data.shape)
Out[127]:
array([[ True, False, False, False, True, False, True, False],
[ True, True, False, True, True, False, True, True],
[False, True, True, True, True, True, False, True],
[False, False, True, True, True, True, True, False],
[ True, True, False, True, False, False, True, False],
[ True, False, False, False, True, False, True, False]], dtype=bool)
The expression a = data < 6
returns a new array because <
is a value comparison operator.
Arithmetic, matrix multiplication, and comparison operations
Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.
Each of the arithmetic operations (+, -, *, /, //, %, divmod(), ** or pow(), <<, >>, &, ^, |, ~) and the comparisons (==, <, >, <=, >=, !=) is equivalent to the corresponding universal function (or ufunc for short) in Numpy.
Note that the in
operator is not in this list. Probably because it works in the opposite direction to most operators.
While a + b
is the same as a.__add__(b)
, a in b
works right to left b.__contains__(a)
. In this case python tries to call set.__contains__()
, which will only accept hashable/immutable types. Arrays are mutable, so they can't be a member of a set.
A solution to this is to use numpy.vectorize
instead of in
directly, and call any python function on each element in the array.
It's a kind of map()
for numpy arrays.
numpy.vectorize
Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
>>> import numpy
>>> data = numpy.random.randint(0, 10, (3, 3))
>>> test = set(numpy.random.randint(0, 10, 5))
>>> numpy.vectorize(test.__contains__)(data)
array([[False, False, True],
[ True, True, False],
[ True, False, True]], dtype=bool)
This approach is fast when n is large, since set.__contains__()
is a constant time operation. ("large" means thattop
> 13000 or so)
>>> import numpy as np
>>> nr, nc = 100, 100
>>> top = 300000
>>> data = np.random.randint(0, top, (nr, nc))
>>> test = set(np.random.randint(0, top, top//3))
>>> %timeit -n10 np.in1d(data, list(test)).reshape(data.shape)
10 loops, best of 3: 26.2 ms per loop
>>> %timeit -n10 np.in1d(data, np.hstack(test)).reshape(data.shape)
10 loops, best of 3: 374 ms per loop
>>> %timeit -n10 np.vectorize(test.__contains__)(data)
10 loops, best of 3: 3.16 ms per loop
However, when n is small, the other solutions are significantly faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With