I am trying to figure out an efficient way of finding row intersections of two <code>np.arrays</code>. Two arrays have the same shapes, and duplicate values in each row cannot happen. For example: <pre class="prettyprint"><code>import numpy as np a = np.array([[2,5,6], [8,2,3], [4,1,5], [1,7,9]]) b = np.array([[2,3,4], # one element(2) in common with a[0] -> 1 [7,4,3], # one element(3) in common with a[1] -> 1 [5,4,1], # three elements(5,4,1) in common with a[2] -> 3 [7,6,9]]) # two element(9,7) in common with a[3] -> 2 </code></pre> My desired output is : <code>np.array([1,1,3,2])</code> It is easy to do this with a loop: <pre class="prettyprint"><code>def get_intersect1ds(a, b): result = np.empty(a.shape[0], dtype=np.int) for i in xrange(a.shape[0]): result[i] = (len(np.intersect1d(a[i], b[i]))) return result </code></pre> Result: <pre class="prettyprint"><code>>>> get_intersect1ds(a, b) array([1, 1, 3, 2]) </code></pre> But is there a more efficient way to do it?

If you have no duplicates within a row you can try to replicate what <code>np.intersect1d</code> does under the hood (see the source code here): <pre class="prettyprint"><code>>>> c = np.hstack((a, b)) >>> c array([[2, 5, 6, 2, 3, 4], [8, 2, 3, 7, 4, 3], [4, 1, 5, 5, 4, 1], [1, 7, 9, 7, 6, 9]]) >>> c.sort(axis=1) >>> c array([[2, 2, 3, 4, 5, 6], [2, 3, 3, 4, 7, 8], [1, 1, 4, 4, 5, 5], [1, 6, 7, 7, 9, 9]]) >>> c[:, 1:] == c[:, :-1] array([[ True, False, False, False, False], [False, True, False, False, False], [ True, False, True, False, True], [False, False, True, False, True]], dtype=bool) >>> np.sum(c[:, 1:] == c[:, :-1], axis=1) array([1, 1, 3, 2]) </code></pre>

This answer might not be viable, because if the input has shape (N, M), it generates an intermediate array with size (N, M, M), but it's always fun to see what you can do with broadcasting: <pre class="prettyprint"><code>In [43]: a Out[43]: array([[2, 5, 6], [8, 2, 3], [4, 1, 5], [1, 7, 9]]) In [44]: b Out[44]: array([[2, 3, 4], [7, 4, 3], [5, 4, 1], [7, 6, 9]]) In [45]: (np.expand_dims(a, -1) == np.expand_dims(b, 1)).sum(axis=-1).sum(axis=-1) Out[45]: array([1, 1, 3, 2]) </code></pre> For large arrays, the method could be made more memory-friendly by doing the operation in batches.

Efficiently find row intersections of two 2-D numpy arrays

Tags:

python

numpy

I am trying to figure out an efficient way of finding row intersections of two np.arrays.

Two arrays have the same shapes, and duplicate values in each row cannot happen.

For example:

import numpy as np

a = np.array([[2,5,6],
              [8,2,3],
              [4,1,5],
              [1,7,9]])

b = np.array([[2,3,4],  # one element(2) in common with a[0] -> 1
              [7,4,3],  # one element(3) in common with a[1] -> 1
              [5,4,1],  # three elements(5,4,1) in common with a[2] -> 3
              [7,6,9]]) # two element(9,7) in common with a[3] -> 2

My desired output is : np.array([1,1,3,2])

It is easy to do this with a loop:

def get_intersect1ds(a, b):
    result = np.empty(a.shape[0], dtype=np.int)
    for i in xrange(a.shape[0]):
        result[i] = (len(np.intersect1d(a[i], b[i])))
    return result

Result:

>>> get_intersect1ds(a, b)
array([1, 1, 3, 2])

But is there a more efficient way to do it?

387

asked Nov 01 '13 16:11

Akavall

2 Answers

If you have no duplicates within a row you can try to replicate what np.intersect1d does under the hood (see the source code here):

>>> c = np.hstack((a, b))
>>> c
array([[2, 5, 6, 2, 3, 4],
       [8, 2, 3, 7, 4, 3],
       [4, 1, 5, 5, 4, 1],
       [1, 7, 9, 7, 6, 9]])
>>> c.sort(axis=1)
>>> c
array([[2, 2, 3, 4, 5, 6],
       [2, 3, 3, 4, 7, 8],
       [1, 1, 4, 4, 5, 5],
       [1, 6, 7, 7, 9, 9]])
>>> c[:, 1:] == c[:, :-1]
array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [ True, False,  True, False,  True],
       [False, False,  True, False,  True]], dtype=bool)
>>> np.sum(c[:, 1:] == c[:, :-1], axis=1)
array([1, 1, 3, 2])

164

answered Sep 19 '22 00:09

Jaime

This answer might not be viable, because if the input has shape (N, M), it generates an intermediate array with size (N, M, M), but it's always fun to see what you can do with broadcasting:

In [43]: a
Out[43]: 
array([[2, 5, 6],
       [8, 2, 3],
       [4, 1, 5],
       [1, 7, 9]])

In [44]: b
Out[44]: 
array([[2, 3, 4],
       [7, 4, 3],
       [5, 4, 1],
       [7, 6, 9]])

In [45]: (np.expand_dims(a, -1) == np.expand_dims(b, 1)).sum(axis=-1).sum(axis=-1)
Out[45]: array([1, 1, 3, 2])

For large arrays, the method could be made more memory-friendly by doing the operation in batches.

answered Sep 20 '22 00:09

Warren Weckesser

Related questions
                            
                                Easy way to fill in an Excel file with Python
                            
                                Convert list of lists to list of integers
                            
                                Why callbacks are "ugly"? [closed]
                            
                                Flask: URLs w/ Variable parameters
                            
                                Pass Flask route parameters into a decorator
                            
                                Fixing matplotlib plot
                            
                                How to differentiate between hasattr and normal attribute access in __getattr__?
                            
                                How do I create an array slice using the NumPy C API?
                            
                                How to analyse bitmap image in python, using PIL?
                            
                                Add key to dict with setattr() in Python
                            
                                Capture image for processing
                            
                                Use Latent Semantic Analysis with sklearn
                            
                                2.2GB JSON file parses inconsistently
                            
                                Ideal data structure with fast lookup, fast update and easy comparison/sorting
                            
                                How can I run Python source from stdin that itself reads from stdin?
                            
                                jsonpickle datetime to readable json format
                            
                                Python: Choose One Item from Every List but Make Every Possible Combination
                            
                                python cache dictionary - counting number of hits
                            
                                Django templates built-in filters: Using a variable value in an argument
                            
                                How to create a fixed size (unsigned) integer in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With