Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

are elements of an array in a set?

Tags:

python

numpy

import numpy
data = numpy.random.randint(0, 10, (6,8))
test = set(numpy.random.randint(0, 10, 5))

I want an expression whose value is a Boolean array, with the same shape of data (or, at least, can be reshaped to the same shape), that tells me if the corresponding term in data is in set.

E.g., if I want to know which elements of data are strictly less than 6, I can use a single vectorized expression,

a = data < 6

that computes a 6x8 boolean ndarray. On the contrary, when I try an apparently equivalent boolean expression

b = data in test

what I get is an exception:

TypeError: unhashable type: 'numpy.ndarray'

Addendum — benmarching different solutions

Edit: the possibility #4 below gives wrong results, thanks to hpaulj and Divakar for getting me on the right track.

Here I compare four different possibilities,

  1. What was proposed by Divakar, np.in1d(data, np.hstack(test)).
  2. One proposal by hpaulj, np.in1d(data, np.array(list(test))).
  3. Another proposal by hpaulj, `np.in1d(data, np.fromiter(test, int)).
  4. What was proposed in an answer removed by its author, whose name I dont remember, np.in1d(data, test).

Here it is the Ipython session, slightly edited to avoid blank lines

In [1]: import numpy as np
In [2]: nr, nc = 100, 100
In [3]: top = 3000
In [4]: data = np.random.randint(0, top, (nr, nc))
In [5]: test = set(np.random.randint(0, top, top//3))
In [6]: %timeit np.in1d(data, np.hstack(test))
100 loops, best of 3: 5.65 ms per loop
In [7]: %timeit np.in1d(data, np.array(list(test)))
1000 loops, best of 3: 1.4 ms per loop
In [8]: %timeit np.in1d(data, np.fromiter(test, int))
1000 loops, best of 3: 1.33 ms per loop

In [9]: %timeit np.in1d(data, test)
1000 loops, best of 3: 687 µs per loop

In [10]: nr, nc = 1000, 1000
In [11]: top = 300000
In [12]: data = np.random.randint(0, top, (nr, nc))
In [13]: test = set(np.random.randint(0, top, top//3))
In [14]: %timeit np.in1d(data, np.hstack(test))
1 loop, best of 3: 706 ms per loop
In [15]: %timeit np.in1d(data, np.array(list(test)))
1 loop, best of 3: 269 ms per loop
In [16]: %timeit np.in1d(data, np.fromiter(test, int))
1 loop, best of 3: 274 ms per loop

In [17]: %timeit np.in1d(data, test)
10 loops, best of 3: 67.9 ms per loop

In [18]: 

The better times are given by the (now) anonymous poster's answer.

It turns out that the anonymous poster had a good reason to remove their answer, the results being wrong!

As commented by hpaulj, in the documentation of in1d there is a warning against the use of a set as the second argument, but I'd like better an explicit failure if the computed results could be wrong.

That said, the solution using numpy.fromiter() has the best numbers...

like image 816
gboffi Avatar asked Jun 12 '16 13:06

gboffi


People also ask

Is an array a set?

A set is unordered and each element can only appear once in a set. While an array can contain duplicate elements, each value contained in a set is unique.

What are elements in an array?

Each item in an array is called an element, and each element is accessed by its numerical index. As shown in the preceding illustration, numbering begins with 0. The 9th element, for example, would therefore be accessed at index 8.

Is set same as array?

One of the biggest differences between an Array and a Set is the order of elements. The documentation describes this as well: Array: “An ordered, random-access collection.” Set: “An unordered collection of unique elements.”

Is array a set of variables?

Creating arrays. An array is a variable containing multiple values. Any variable may be used as an array. There is no maximum limit to the size of an array, nor any requirement that member variables be indexed or assigned contiguously.


2 Answers

I am assuming you are looking to find a boolean array to detect the presence of the set elements in data array. To do so, you can extract the elements from set with np.hstack and then use np.in1d to detect presence of any element from set at each position in data, giving us a boolean array of the same size as data. Since, np.in1d flattens the input before processing, so as a final step, we need to reshape the output from np.in1d back to its original 2D shape. Thus, the final implementation would be -

np.in1d(data,np.hstack(test)).reshape(data.shape)

Sample run -

In [125]: data
Out[125]: 
array([[7, 0, 1, 8, 9, 5, 9, 1],
       [9, 7, 1, 4, 4, 2, 4, 4],
       [0, 4, 9, 6, 6, 3, 5, 9],
       [2, 2, 7, 7, 6, 7, 7, 2],
       [3, 4, 8, 4, 2, 1, 9, 8],
       [9, 0, 8, 1, 6, 1, 3, 5]])

In [126]: test
Out[126]: {3, 4, 6, 7, 9}

In [127]: np.in1d(data,np.hstack(test)).reshape(data.shape)
Out[127]: 
array([[ True, False, False, False,  True, False,  True, False],
       [ True,  True, False,  True,  True, False,  True,  True],
       [False,  True,  True,  True,  True,  True, False,  True],
       [False, False,  True,  True,  True,  True,  True, False],
       [ True,  True, False,  True, False, False,  True, False],
       [ True, False, False, False,  True, False,  True, False]], dtype=bool)
like image 167
Divakar Avatar answered Sep 22 '22 02:09

Divakar


The expression a = data < 6 returns a new array because < is a value comparison operator.

Arithmetic, matrix multiplication, and comparison operations

Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally yield ndarray objects as results.

Each of the arithmetic operations (+, -, *, /, //, %, divmod(), ** or pow(), <<, >>, &, ^, |, ~) and the comparisons (==, <, >, <=, >=, !=) is equivalent to the corresponding universal function (or ufunc for short) in Numpy.

Note that the in operator is not in this list. Probably because it works in the opposite direction to most operators.

While a + b is the same as a.__add__(b), a in b works right to left b.__contains__(a). In this case python tries to call set.__contains__(), which will only accept hashable/immutable types. Arrays are mutable, so they can't be a member of a set.

A solution to this is to use numpy.vectorize instead of in directly, and call any python function on each element in the array.

It's a kind of map() for numpy arrays.

numpy.vectorize

Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns a numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

>>> import numpy
>>> data = numpy.random.randint(0, 10, (3, 3))
>>> test = set(numpy.random.randint(0, 10, 5))
>>> numpy.vectorize(test.__contains__)(data)

array([[False, False,  True],
       [ True,  True, False],
       [ True, False,  True]], dtype=bool)

Benchmarks

This approach is fast when n is large, since set.__contains__() is a constant time operation. ("large" means thattop > 13000 or so)

>>> import numpy as np
>>> nr, nc = 100, 100
>>> top = 300000
>>> data = np.random.randint(0, top, (nr, nc))
>>> test = set(np.random.randint(0, top, top//3))
>>> %timeit -n10 np.in1d(data, list(test)).reshape(data.shape)
10 loops, best of 3: 26.2 ms per loop

>>> %timeit -n10 np.in1d(data, np.hstack(test)).reshape(data.shape)
10 loops, best of 3: 374 ms per loop

>>> %timeit -n10 np.vectorize(test.__contains__)(data)
10 loops, best of 3: 3.16 ms per loop

However, when n is small, the other solutions are significantly faster.

like image 33
Håken Lid Avatar answered Sep 19 '22 02:09

Håken Lid