I have two arrays <code>A</code> (len of 3.8million) and <code>B</code> (len of 20k). For the minimal example, lets take this case: <pre class="prettyprint"><code>A = np.array([1,1,2,3,3,3,4,5,6,7,8,8]) B = np.array([1,2,8]) </code></pre> Now I want the resulting array to be: <pre class="prettyprint"><code>C = np.array([3,3,3,4,5,6,7]) </code></pre> i.e. if any value in <code>B</code> is found in <code>A</code>, remove it from <code>A</code>, if not keep it. I would like to know if there is any way to do it without a <code>for</code> loop because it is a lengthy array and so it takes long time to loop.

<h3>Using <code>searchsorted</code> </h3> With sorted <code>B</code>, we can use <code>searchsorted</code> - <pre class="prettyprint"><code>A[B[np.searchsorted(B,A)] != A] </code></pre> From the linked docs, <code>searchsorted(a,v)</code> find the indices into a sorted array <code>a</code> such that, if the corresponding elements in <code>v</code> were inserted before the indices, the order of a would be preserved. So, let's say <code>idx = searchsorted(B,A)</code> and we index into <code>B</code> with those : <code>B[idx]</code>, we will get a mapped version of <code>B</code> corresponding to every element in <code>A</code>. Thus, comparing this mapped version against <code>A</code> would tell us for every element in <code>A</code> if there's a match in <code>B</code> or not. Finally, index into <code>A</code> to select the non-matching ones. Generic case (<code>B</code> is not sorted) : If <code>B</code> is not already sorted as is the pre-requisite, sort it and then use the proposed method. Alternatively, we can use <code>sorter</code> argument with <code>searchsorted</code> - <pre class="prettyprint"><code>sidx = B.argsort() out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A] </code></pre> More generic case (<code>A</code> has values higher than ones in <code>B</code>) : <pre class="prettyprint"><code>sidx = B.argsort() idx = np.searchsorted(B,A,sorter=sidx) idx[idx==len(B)] = 0 out = A[B[sidx[idx]] != A] </code></pre> <hr> <h3>Using <code>in1d/isin</code> </h3> We can also use <code>np.in1d</code>, which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in <code>B</code> for every element in <code>A</code> and then we can use boolean-indexing with an inverted mask to look for non-matching ones - <pre class="prettyprint"><code>A[~np.in1d(A,B)] </code></pre> Same with <code>isin</code> - <pre class="prettyprint"><code>A[~np.isin(A,B)] </code></pre> With <code>invert</code> flag - <pre class="prettyprint"><code>A[np.in1d(A,B,invert=True)] A[np.isin(A,B,invert=True)] </code></pre> This solves for a generic when <code>B</code> is not necessarily sorted.

Adding to Divakar's answer above - if the original array A has a wider range than B, that will give you an 'index out of bounds' error. See: <pre class="prettyprint"><code>A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14]) B = np.array([1,2,8]) A[B[np.searchsorted(B,A)] != A] >> IndexError: index 3 is out of bounds for axis 0 with size 3 </code></pre> This will happen because <code>np.searchsorted</code> will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example. Thus you get an IndexError in <code>B[np.searchsorted(B,A)]</code>. To circumvent that, a possible approach is: <pre class="prettyprint"><code>def subset_sorted_array(A,B): Aa = A[np.where(A <= np.max(B))] Bb = (B[np.searchsorted(B,Aa)] != Aa) Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True) return A[Bb] </code></pre> Which works as follows: <pre class="prettyprint"><code># Take only the elements in A that would be inserted in B Aa = A[np.where(A <= np.max(B))] # Pad the resulting filter with 'Trues' - I split this in two operations for # easier reading Bb = (B[np.searchsorted(B,Aa)] != Aa) Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True) # Then you can filter A by Bb A[Bb] # For the input arrays above: >> array([ 3, 3, 3, 4, 5, 6, 7, 10, 12, 14]) </code></pre> Notice this will also work between arrays of strings and other types (for all types for which the comparison <code><=</code> operator is defined).

Remove elements from one array if present in another array, keep duplicates - NumPy / Python

Tags:

python

for-loop

unique

numpy

I have two arrays A (len of 3.8million) and B (len of 20k). For the minimal example, lets take this case:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8])
B = np.array([1,2,8])

Now I want the resulting array to be:

C = np.array([3,3,3,4,5,6,7])

i.e. if any value in B is found in A, remove it from A, if not keep it.

I would like to know if there is any way to do it without a for loop because it is a lengthy array and so it takes long time to loop.

669

asked Sep 20 '18 05:09

Srivatsan

2 Answers

Using `searchsorted`

With sorted B, we can use searchsorted -

A[B[np.searchsorted(B,A)] !=  A]

From the linked docs, searchsorted(a,v) find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved. So, let's say idx = searchsorted(B,A) and we index into B with those : B[idx], we will get a mapped version of B corresponding to every element in A. Thus, comparing this mapped version against A would tell us for every element in A if there's a match in B or not. Finally, index into A to select the non-matching ones.

Generic case (B is not sorted) :

If B is not already sorted as is the pre-requisite, sort it and then use the proposed method.

Alternatively, we can use sorter argument with searchsorted -

sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]

More generic case (A has values higher than ones in B) :

sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]

Using `in1d/isin`

We can also use np.in1d, which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in B for every element in A and then we can use boolean-indexing with an inverted mask to look for non-matching ones -

A[~np.in1d(A,B)]

Same with isin -

A[~np.isin(A,B)]

With invert flag -

A[np.in1d(A,B,invert=True)]

A[np.isin(A,B,invert=True)]

This solves for a generic when B is not necessarily sorted.

answered Oct 05 '22 22:10

Divakar

Adding to Divakar's answer above -

if the original array A has a wider range than B, that will give you an 'index out of bounds' error. See:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])

A[B[np.searchsorted(B,A)] !=  A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3

This will happen because np.searchsorted will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example. Thus you get an IndexError in B[np.searchsorted(B,A)].

To circumvent that, a possible approach is:

def subset_sorted_array(A,B):
    Aa = A[np.where(A <= np.max(B))]
    Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
    Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
    return A[Bb]

Which works as follows:

# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]

# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]),  method='constant', constant_values=True)

# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3,  3,  3,  4,  5,  6,  7, 10, 12, 14])

Notice this will also work between arrays of strings and other types (for all types for which the comparison <= operator is defined).

answered Oct 05 '22 22:10

vmg

Related questions
                            
                                How can I compute the absolute sum with a groupby in pandas?
                            
                                How to make sklearn.metrics.confusion_matrix() to always return TP, TN, FP, FN?
                            
                                Rotated image coordinates after scipy.ndimage.interpolation.rotate?
                            
                                How can I print the Learning Rate at each epoch with Adam optimizer in Keras?
                            
                                Tensorflow LinearRegressor Feature Cannot have rank 0
                            
                                drop unused categories using groupby on categorical variable in pandas
                            
                                Remove duplicates from rows and columns (cell) in a dataframe, python
                            
                                Boto 3 DynamoDB batchWriteItem Invalid attribute value type when specifying types
                            
                                wxPython: This program needs access to the screen
                            
                                How to mock AWS DynamoDB service?
                            
                                Error in Django when using matplotlib examples
                            
                                Python Pandas - How to write in a specific column in an Excel Sheet
                            
                                How to generate python class files from protobuf
                            
                                Show more images in Tensorboard - Tensorflow object detection
                            
                                Find first non-zero value in each column of pandas DataFrame
                            
                                What is the best way to show data in a table in Tkinter?
                            
                                Python: Barplot with colorbar
                            
                                Scikit-learn multithreading
                            
                                Spacy - Save custom pipeline
                            
                                How to remove strings present in a list from a column in pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove elements from one array if present in another array, keep duplicates - NumPy / Python

Tags:

python

for-loop

unique

numpy

Srivatsan

People also ask

2 Answers

Using `searchsorted`

Using `in1d/isin`

Divakar

vmg

Recent Activity

Donate For Us

Remove elements from one array if present in another array, keep duplicates - NumPy / Python

Tags:

python

for-loop

unique

numpy

Srivatsan

People also ask

2 Answers

Using searchsorted

Using in1d/isin

Divakar

vmg

Related questions

Recent Activity

Donate For Us

Using `searchsorted`

Using `in1d/isin`