Suppose I have the following two arrays:
a = array([(1, 'L', 74.423088306605), (5, 'H', 128.05441039929008),
(2, 'L', 68.0581377353869), (0, 'H', 88.15726964130869),
(4, 'L', 97.4501582588212), (3, 'H', 92.98550136344437),
(7, 'L', 87.75945631669309), (6, 'L', 90.43196739694255),
(8, 'H', 111.13662092749307), (15, 'H', 91.44444608631304),
(10, 'L', 85.43615908319185), (11, 'L', 78.11685661303494),
(13, 'H', 108.2841293816308), (17, 'L', 74.43917911042259),
(14, 'H', 64.41057325770373), (9, 'L', 27.407214746467943),
(16, 'H', 81.50506434964355), (12, 'H', 97.79700070323196),
(19, 'L', 51.139258140713025), (18, 'H', 118.34835768605957)],
dtype=[('id', '<i4'), ('name', 'S1'), ('value', '<f8')])
b = array([ 0, 3, 5, 8, 12, 13, 14, 15, 16, 18], dtype=int32)
I want to select elements from a
for which the id
is given in b
. That is, b
is not an index array. It contains the ids
of the observations. How can I do this in numpy?
Thanks for the help.
you should get what you want with this
indeces = [i for i,id in enumerate(a['id']) if id in b]
suba = a[indeces]
print(suba)
>>>array([(5, 'H', 128.05441039929008), (0, 'H', 88.15726964130869),
(3, 'H', 92.98550136344437), (8, 'H', 111.13662092749307),
(15, 'H', 91.44444608631304), (13, 'H', 108.2841293816308),
(14, 'H', 64.41057325770373), (16, 'H', 81.50506434964355),
(12, 'H', 97.79700070323196), (18, 'H', 118.34835768605957)],
dtype=[('id', '<i4'), ('name', '|S1'), ('value', '<f8')])
The following works several times faster than Francesco's approach for your sample array:
In [7]: a[np.argmax(a['id'][None, :] == b[:, None], axis=1)]
Out[7]:
array([(0, 'H', 88.15726964130869), (3, 'H', 92.98550136344437),
(5, 'H', 128.05441039929008), (8, 'H', 111.13662092749307),
(12, 'H', 97.79700070323196), (13, 'H', 108.2841293816308),
(14, 'H', 64.41057325770373), (15, 'H', 91.44444608631304),
(16, 'H', 81.50506434964355), (18, 'H', 118.34835768605957)],
dtype=[('id', '<i4'), ('name', '|S1'), ('value', '<f8')])
In [8]: %timeit a[np.argmax(a['id'][None, :] == b[:, None], axis=1)]
100000 loops, best of 3: 11.6 us per loop
In [9]: %timeit indices = [i for i,id in enumerate(a['id']) if id in b]; a[indices]
10000 loops, best of 3: 66.9 us per loop
To understand how it works, take a look at this:
In [10]: a['id'][None, :] == b[:, None]
Out[10]:
array([[False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False],
... # several rows removed
[False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, True]], dtype=bool)
It is an array of as many rows as elements in b
and as many columns as elements in a
. np.argmax
then finds the position of the first True
in every row, which is the index of the first appearance of the corresponding element of b
in a['id']
.
As shown above, for small arrays this beats python performance-wise. But if either a
or b
get too big, then the size of the intermediate array of bool
s can cripple performance. Also, np.argmax
has to search the full row, it never breaks out of the loop early, which is not a good thing if a
is too long. I did some timings in an answer to this question that uses a similar approach, and there it was still the way to go for moderately large arrays.
Francesco's approach is definitely less hacky, easier to understand, and for an array the size of your sample the performance differences are irrelevant, I must admit. But it doesn't make you feel as much like this...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With