<p>Suppose I have an array</p> <pre class="prettyprint"><code>a = np.array([1, 2, 1, 3, 3, 3, 0]) </code></pre> <p>How can I (efficiently, Pythonically) find which elements of <code>a</code> are duplicates (i.e., non-unique values)? In this case the result would be <code>array([1, 3, 3])</code> or possibly <code>array([1, 3])</code> if efficient.</p> <p>I've come up with a few methods that appear to work:</p> <h3>Masking</h3> <pre class="prettyprint"><code>m = np.zeros_like(a, dtype=bool) m[np.unique(a, return_index=True)[1]] = True a[~m] </code></pre> <h3>Set operations</h3> <pre class="prettyprint"><code>a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)] </code></pre> <p>This one is cute but probably illegal (as <code>a</code> isn't actually unique):</p> <pre class="prettyprint"><code>np.setxor1d(a, np.unique(a), assume_unique=True) </code></pre> <h3>Histograms</h3> <pre class="prettyprint"><code>u, i = np.unique(a, return_inverse=True) u[np.bincount(i) > 1] </code></pre> <h3>Sorting</h3> <pre class="prettyprint"><code>s = np.sort(a, axis=None) s[:-1][s[1:] == s[:-1]] </code></pre> <h3>Pandas</h3> <pre class="prettyprint"><code>s = pd.Series(a) s[s.duplicated()] </code></pre> <p>Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).</p> <hr> <h3>Conclusions</h3> <p>Testing with a 10 million size data set (on a 2.8GHz Xeon):</p> <pre class="prettyprint"><code>a = np.random.randint(10**7, size=10**7) </code></pre> <p>The fastest is sorting, at 1.1s. The dubious <code>xor1d</code> is second at 2.6s, followed by masking and Pandas <code>Series.duplicated</code> at 3.1s, <code>bincount</code> at 5.6s, and <code>in1d</code> and senderle's <code>setdiff1d</code> both at 7.3s. Steven's <code>Counter</code> is only a little slower, at 10.5s; trailing behind are Burhan's <code>Counter.most_common</code> at 110s and DSM's <code>Counter</code> subtraction at 360s.</p> <p>I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it <em>feels</em> clearer and more Pythonic.</p> <p>Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.</p>

<p>As of numpy version 1.9.0, <code>np.unique</code> has an argument <code>return_counts</code> which greatly simplifies your task:</p> <pre class="prettyprint"><code>u, c = np.unique(a, return_counts=True) dup = u[c > 1] </code></pre> <p>This is similar to using <code>Counter</code>, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.</p> <p>It's probably worth mentioning that even though <code>np.unique</code> is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the <code>Counter</code> solution. <code>np.unique</code> is sort-based, so runs asymptotically in <code>O(n log n)</code> time. <code>Counter</code> is hash-based, so has <code>O(n)</code> complexity. This will not matter much for anything but the largest datasets.</p>

<p>I think this is most clear done outside of <code>numpy</code>. You'll have to time it against your <code>numpy</code> solutions if you are concerned with speed.</p> <pre class="prettyprint"><code>>>> import numpy as np >>> from collections import Counter >>> a = np.array([1, 2, 1, 3, 3, 3, 0]) >>> [item for item, count in Counter(a).items() if count > 1] [1, 3] </code></pre> <p><em>note:</em> This is similar to Burhan Khalid's answer, but the use of <code>items</code> without subscripting in the condition should be faster.</p>

Determining duplicate values in an array

Q: How do I check if an array contains duplicate numbers?

To check if an array contains duplicates:Use the Array. some() method to iterate over the array. Check if the index of the first occurrence of the current value is NOT equal to the index of its last occurrence. If the condition is met, then the array contains duplicates.

Tags:

python

duplicates

unique

numpy

Suppose I have an array

a = np.array([1, 2, 1, 3, 3, 3, 0])

How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.

I've come up with a few methods that appear to work:

Masking

m = np.zeros_like(a, dtype=bool) m[np.unique(a, return_index=True)[1]] = True a[~m]

Set operations

a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]

This one is cute but probably illegal (as a isn't actually unique):

np.setxor1d(a, np.unique(a), assume_unique=True)

Histograms

u, i = np.unique(a, return_inverse=True) u[np.bincount(i) > 1]

Sorting

s = np.sort(a, axis=None) s[:-1][s[1:] == s[:-1]]

Pandas

s = pd.Series(a) s[s.duplicated()]

Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).

Conclusions

Testing with a 10 million size data set (on a 2.8GHz Xeon):

a = np.random.randint(10**7, size=10**7)

The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.

I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.

Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.

744

asked Jul 17 '12 17:07

ecatmur

2 Answers

As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:

u, c = np.unique(a, return_counts=True) dup = u[c > 1]

This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.

It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.

104

answered Oct 14 '22 10:10

Mad Physicist

I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.

>>> import numpy as np >>> from collections import Counter >>> a = np.array([1, 2, 1, 3, 3, 3, 0]) >>> [item for item, count in Counter(a).items() if count > 1] [1, 3]

note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.