Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove elements from one array if present in another array, keep duplicates - NumPy / Python

I have two arrays A (len of 3.8million) and B (len of 20k). For the minimal example, lets take this case:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8])
B = np.array([1,2,8])

Now I want the resulting array to be:

C = np.array([3,3,3,4,5,6,7])

i.e. if any value in B is found in A, remove it from A, if not keep it.

I would like to know if there is any way to do it without a for loop because it is a lengthy array and so it takes long time to loop.

like image 669
Srivatsan Avatar asked Sep 20 '18 05:09

Srivatsan


People also ask

How do you remove an array of elements from another array in Python?

Removing Array Elements You can use the pop() method to remove an element from the array.

How do you remove from one array the items that exist in another?

For removing one array from another array in java we will use the removeAll() method. This will remove all the elements of the array1 from array2 if we call removeAll() function from array2 and array1 as a parameter.

How do I remove one element from an array in NumPy?

Deleting element from NumPy array using np. The delete(array_name ) method will be used to do the same. Where array_name is the name of the array to be deleted and index-value is the index of the element to be deleted.


2 Answers

Using searchsorted

With sorted B, we can use searchsorted -

A[B[np.searchsorted(B,A)] !=  A]

From the linked docs, searchsorted(a,v) find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved. So, let's say idx = searchsorted(B,A) and we index into B with those : B[idx], we will get a mapped version of B corresponding to every element in A. Thus, comparing this mapped version against A would tell us for every element in A if there's a match in B or not. Finally, index into A to select the non-matching ones.

Generic case (B is not sorted) :

If B is not already sorted as is the pre-requisite, sort it and then use the proposed method.

Alternatively, we can use sorter argument with searchsorted -

sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]

More generic case (A has values higher than ones in B) :

sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]

Using in1d/isin

We can also use np.in1d, which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in B for every element in A and then we can use boolean-indexing with an inverted mask to look for non-matching ones -

A[~np.in1d(A,B)]

Same with isin -

A[~np.isin(A,B)]

With invert flag -

A[np.in1d(A,B,invert=True)]

A[np.isin(A,B,invert=True)]

This solves for a generic when B is not necessarily sorted.

like image 88
Divakar Avatar answered Oct 05 '22 22:10

Divakar


Adding to Divakar's answer above -

if the original array A has a wider range than B, that will give you an 'index out of bounds' error. See:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])

A[B[np.searchsorted(B,A)] !=  A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3

This will happen because np.searchsorted will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example. Thus you get an IndexError in B[np.searchsorted(B,A)].

To circumvent that, a possible approach is:

def subset_sorted_array(A,B):
    Aa = A[np.where(A <= np.max(B))]
    Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
    Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
    return A[Bb]

Which works as follows:

# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]

# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]),  method='constant', constant_values=True)

# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3,  3,  3,  4,  5,  6,  7, 10, 12, 14])

Notice this will also work between arrays of strings and other types (for all types for which the comparison <= operator is defined).

like image 20
vmg Avatar answered Oct 05 '22 22:10

vmg