Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the set difference between two large arrays (matrices) in Python

I have two large 2-d arrays and I'd like to find their set difference taking their rows as elements. In Matlab, the code for this would be setdiff(A,B,'rows'). The arrays are large enough that the obvious looping methods I could think of take too long.

like image 271
zss Avatar asked Aug 10 '12 13:08

zss


People also ask

How do I compare two arrays of different sizes Python?

import numpy as np A = np. array([[1, 1], [2, 2]]) B = np. array([[1, 1], [2, 2]]) print(A == B) In this resulting matrix, each element is a result of a comparison of two corresponding elements in the two arrays.

How do you find the difference between two arrays in Numpy?

Step 1: Import numpy. Step 2: Define two numpy arrays. Step 3: Find the set difference between these arrays using the setdiff1d() function. Step 4: Print the output.

How do you compare matrices in Python?

Algorithm. Step 1: Create two matrix. Step 2: Then traverse every element of the first matrix and second matrix and compare every element of the first matrix with the second matrix. Step 3: If the both are same then both matrices are identical.


2 Answers

This should work, but is currently broken in 1.6.1 due to an unavailable mergesort for the view being created. It works in the pre-release 1.7.0 version. This should be the fastest way possible, since the views don't have to copy any memory:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = a1.view([('', a1.dtype)] * a1.shape[1])
>>> a2_rows = a2.view([('', a2.dtype)] * a2.shape[1])
>>> np.setdiff1d(a1_rows, a2_rows).view(a1.dtype).reshape(-1, a1.shape[1])
array([[1, 2, 3]])

You can do this in Python, but it might be slow:

>>> import numpy as np
>>> a1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> a2 = np.array([[4,5,6],[7,8,9],[1,1,1]])
>>> a1_rows = set(map(tuple, a1))
>>> a2_rows = set(map(tuple, a2))
>>> a1_rows.difference(a2_rows)
set([(1, 2, 3)])
like image 57
jterrace Avatar answered Oct 20 '22 14:10

jterrace


Here is a nice alternative pure numpy solution that works for 1.6.1. It does create an intermediate array, so this may or may not be a problem for you. It also does not rely on any speedup from a sorted array or not (as setdiff probably does).

from numpy import *
# Create some sample arrays
A =random.randint(0,5,(10,3))
B =random.randint(0,5,(10,3))

As an example, this is what I got - note that there is one common element:

>>> A
array([[1, 0, 3],
       [0, 4, 2],
       [0, 3, 4],
       [4, 4, 2],
       [2, 0, 2],
       [4, 0, 0],
       [3, 2, 2],
       [4, 2, 3],
       [0, 2, 1],
       [2, 0, 2]])
>>> B
array([[4, 1, 3],
       [4, 3, 0],
       [0, 3, 3],
       [3, 0, 3],
       [3, 4, 0],
       [3, 2, 3],
       [3, 1, 2],
       [4, 1, 2],
       [0, 4, 2],
       [0, 0, 3]])

We look for when the (L1) distance between the rows is zero. This gives us a matrix, which at the points where it is zero, these are the items common to both lists:

idx = where(abs((A[:,newaxis,:] - B)).sum(axis=2)==0)

As a check:

>>> A[idx[0]]
array([[0, 4, 2]])
>>> B[idx[1]]
array([[0, 4, 2]])
like image 34
Hooked Avatar answered Oct 20 '22 15:10

Hooked