I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be <code>setdiff(A,B,'rows')</code>. The arrays are large enough that the obvious looping methods I could think of take too long.

This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory: <pre class="prettyprint"><code>>>> import numpy as np >>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]]) >>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]]) >>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1]) >>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1]) >>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1]) array([[1, 2, 3]]) </code></pre> You can do this in Python, but it might be slow: <pre class="prettyprint"><code>>>> import numpy as np >>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]]) >>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]]) >>> a1_rows = set(map(tuple, a1)) >>> a2_rows = set(map(tuple, a2)) >>> a1_rows.difference(a2_rows) set([(1, 2, 3)]) </code></pre>

Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as <code>setdiff</code> probably does). <pre class="prettyprint"><code>from numpy import * # Create some sample arrays A =random.randint(0,5,(10,3)) B =random.randint(0,5,(10,3)) </code></pre> As an example, this is what I got - note that there is one common element: <pre class="prettyprint"><code>>>> A array([[1, 0, 3], [0, 4, 2], [0, 3, 4], [4, 4, 2], [2, 0, 2], [4, 0, 0], [3, 2, 2], [4, 2, 3], [0, 2, 1], [2, 0, 2]]) >>> B array([[4, 1, 3], [4, 3, 0], [0, 3, 3], [3, 0, 3], [3, 4, 0], [3, 2, 3], [3, 1, 2], [4, 1, 2], [0, 4, 2], [0, 0, 3]]) </code></pre> We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists: <pre class="prettyprint"><code>idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0) </code></pre> As a check: <pre class="prettyprint"><code>>>> A[idx[0]] array([[0, 4, 2]]) >>> B[idx[1]] array([[0, 4, 2]]) </code></pre>

Find the set difference between two large arrays (matrices) in Python

Tags:

python

set-difference

numpy

I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,'rows'). The arrays are large enough that the obvious looping methods I could think of take too long.

271

asked Aug 10 '12 13:08

zss

2 Answers

This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
>>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
>>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
array([[1, 2, 3]])

You can do this in Python, but it might be slow:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = set(map(tuple, a1))
>>> a2_rows = set(map(tuple, a2))
>>> a1_rows.difference(a2_rows)
set([(1, 2, 3)])

answered Oct 20 '22 14:10

jterrace

Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff probably does).

from numpy import *
# Create some sample arrays
A =random.randint(0,5,(10,3))
B =random.randint(0,5,(10,3))

As an example, this is what I got - note that there is one common element:

>>> A
array([[1, 0, 3],
       [0, 4, 2],
       [0, 3, 4],
       [4, 4, 2],
       [2, 0, 2],
       [4, 0, 0],
       [3, 2, 2],
       [4, 2, 3],
       [0, 2, 1],
       [2, 0, 2]])
>>> B
array([[4, 1, 3],
       [4, 3, 0],
       [0, 3, 3],
       [3, 0, 3],
       [3, 4, 0],
       [3, 2, 3],
       [3, 1, 2],
       [4, 1, 2],
       [0, 4, 2],
       [0, 0, 3]])

We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:

idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)

As a check:

>>> A[idx[0]]
array([[0, 4, 2]])
>>> B[idx[1]]
array([[0, 4, 2]])

answered Oct 20 '22 15:10

Hooked

Related questions
                            
                                Command to display active conda channels
                            
                                In Django how do I notify a parent when a child is saved in a foreign key relationship?
                            
                                How to create MS Paint clone with Python and pygame
                            
                                How do you safely and efficiently get the row id after an insert with mysql using MySQLdb in python?
                            
                                Using easy_install inside a python script?
                            
                                Many-to-one mapping (creating equivalence classes)
                            
                                What's the best tool to parse log files? [closed]
                            
                                Python: getting filename case as stored in Windows?
                            
                                Including global package into a virtualenv that has been created with --no-site-packages
                            
                                Is it safe to use SQLalchemy with gevent?
                            
                                Inverse of a matrix in SymPy?
                            
                                In python shell, "b" letter does not work, what the?
                            
                                Redefining logging root logger
                            
                                How to configure all loggers in an application
                            
                                Project structure for python projects with maven
                            
                                How to make Fabric continue running the next command after getting the exit status: 1?
                            
                                What are the limitations of distributing .pyc files?
                            
                                removing an instance of an object in python list
                            
                                Is SQLAlchemy still recommended if only used for raw sql query?
                            
                                Numpy dot product very slow using ints

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With