Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this: <pre class="prettyprint"><code>def uniq(input): output = [] for x in input: if x not in output: output.append(x) return output </code></pre> (Thanks to unwind for that code sample.) But I'd like to avail myself of a built-in or a more Pythonic idiom if possible. Related question: In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?

The best solution varies by Python version and environment constraints: <h4>Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):</h4> First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain <code>dict</code> is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) <code>collections.OrderedDict</code>. So the fastest solution, by far, is also the simplest: <pre class="prettyprint"><code>>>> items = [1, 2, 0, 1, 3, 2] >>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer [1, 2, 0, 3] </code></pre> Like <code>list(set(items))</code> this pushes all the work to the C layer (on CPython), but since <code>dict</code>s are insertion ordered, <code>dict.fromkeys</code> doesn't lose ordering. It's slower than <code>list(set(items))</code> (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of <code>set</code>s in a listcomp). Important note: The <code>unique_everseen</code> solution from <code>more_itertools</code> (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work. <h4>Python 3.5 (and all older versions if performance isn't critical)</h4> As Raymond pointed out, in CPython 3.5 where <code>OrderedDict</code> is implemented in C, ugly list comprehension hacks are slower than <code>OrderedDict.fromkeys</code> (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the <code>OrderedDict</code> equivalent of the 3.6+ use of plain <code>dict</code>: <pre class="prettyprint"><code>>>> from collections import OrderedDict >>> items = [1, 2, 0, 1, 3, 2] >>> list(OrderedDict.fromkeys(items)) [1, 2, 0, 3] </code></pre> On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading. <h4>Python 3.4 and earlier, if performance is critical and third-party modules are acceptable</h4> As @abarnert notes, the <code>more_itertools</code> library (<code>pip install more_itertools</code>) contains a <code>unique_everseen</code> function that is built to solve this problem without any unreadable (<code>not seen.add</code>) mutations in list comprehensions. This is the fastest solution too: <pre class="prettyprint"><code>>>> from more_itertools import unique_everseen >>> items = [1, 2, 0, 1, 3, 2] >>> list(unique_everseen(items)) [1, 2, 0, 3] </code></pre> Just one simple library import and no hacks. The module is adapting the itertools recipe <code>unique_everseen</code> which looks like: <pre class="prettyprint"><code>def unique_everseen(iterable, key=None): "List unique elements, preserving order. Remember all elements ever seen." # unique_everseen('AAAABBBCCDAABBB') --> A B C D # unique_everseen('ABBCcAD', str.lower) --> A B C D seen = set() seen_add = seen.add if key is None: for element in filterfalse(seen.__contains__, iterable): seen_add(element) yield element else: for element in iterable: k = key(element) if k not in seen: seen_add(k) yield element </code></pre> but unlike the <code>itertools</code> recipe, it supports non-hashable items (at a performance cost; if all elements in <code>iterable</code> are non-hashable, the algorithm becomes <code>O(n²)</code>, vs. <code>O(n)</code> if they're all hashable). Important note: Unlike all the other solutions here, <code>unique_everseen</code> can be used lazily; the peak memory usage will be the same (eventually, the underlying <code>set</code> grows to the same size), but if you don't <code>list</code>ify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item. <h4>Python 3.4 and earlier, if performance is critical and third party modules are unavailable</h4> You have two options: <ol> <li> Copy and paste in the <code>unique_everseen</code> recipe to your code and use it per the <code>more_itertools</code> example above </li> <li> Use ugly hacks to allow a single listcomp to both check and update a <code>set</code> to track what's been seen: <pre class="prettyprint"><code>seen = set() [x for x in seq if x not in seen and not seen.add(x)] </code></pre> at the expense of relying on the ugly hack: <pre class="prettyprint"><code> not seen.add(x) </code></pre> which relies on the fact that <code>set.add</code> is an in-place method that always returns <code>None</code> so <code>not None</code> evaluates to <code>True</code>. </li> </ol> Note that all of the solutions above are <code>O(n)</code> (save calling <code>unique_everseen</code> on an iterable of non-hashable items, which is <code>O(n²)</code>, while the others would fail immediately with a <code>TypeError</code>), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).

How do you remove duplicates from a list whilst preserving order?

Tags:

python

list

duplicates

unique

Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this:

def uniq(input):   output = []   for x in input:     if x not in output:       output.append(x)   return output

(Thanks to unwind for that code sample.)

But I'd like to avail myself of a built-in or a more Pythonic idiom if possible.

Related question: In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?

448

asked Jan 26 '09 15:01

Josh Glover

2 Answers

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):     seen = set()     seen_add = seen.add     return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)

185

answered Sep 20 '22 19:09

Markus Jarderot

The best solution varies by Python version and environment constraints:

Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):

First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:

>>> items = [1, 2, 0, 1, 3, 2] >>> list(dict.fromkeys(items))  # Or [*dict.fromkeys(items)] if you prefer [1, 2, 0, 3]

Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn't lose ordering. It's slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).

Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work.

Python 3.5 (and all older versions if performance isn't critical)

As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:

>>> from collections import OrderedDict >>> items = [1, 2, 0, 1, 3, 2] >>> list(OrderedDict.fromkeys(items)) [1, 2, 0, 3]

On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.

Python 3.4 and earlier, if performance is critical and third-party modules are acceptable

As @abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:

>>> from more_itertools import unique_everseen >>> items = [1, 2, 0, 1, 3, 2] >>> list(unique_everseen(items)) [1, 2, 0, 3]

Just one simple library import and no hacks.

The module is adapting the itertools recipe unique_everseen which looks like:

def unique_everseen(iterable, key=None):     "List unique elements, preserving order. Remember all elements ever seen."     # unique_everseen('AAAABBBCCDAABBB') --> A B C D     # unique_everseen('ABBCcAD', str.lower) --> A B C D     seen = set()     seen_add = seen.add     if key is None:         for element in filterfalse(seen.__contains__, iterable):             seen_add(element)             yield element     else:         for element in iterable:             k = key(element)             if k not in seen:                 seen_add(k)                 yield element

but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they're all hashable).

Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don't listify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item.

Python 3.4 and earlier, if performance is critical and third party modules are unavailable

You have two options:

Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above
Use ugly hacks to allow a single listcomp to both check and update a set to track what's been seen:
```
seen = set() [x for x in seq if x not in seen and not seen.add(x)] 
```
at the expense of relying on the ugly hack:
```
 not seen.add(x) 
```
which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.

Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).

answered Sep 23 '22 19:09

11 revs, 6 users 61%

Related questions
                            
                                List comprehension vs map
                            
                                Shuffling a list of objects
                            
                                pip install from git repo branch
                            
                                How do I determine the size of an object in Python?
                            
                                How to properly ignore exceptions
                            
                                error: Unable to find vcvarsall.bat
                            
                                What is __pycache__?
                            
                                What is the easiest way to remove all packages installed by pip?
                            
                                Is there any way to kill a Thread?
                            
                                python setup.py uninstall
                            
                                How to concatenate items in a list to a single string?
                            
                                How to POST JSON data with Python Requests?
                            
                                How to set environment variables in Python?
                            
                                Python `if x is not None` or `if not x is None`? [closed]
                            
                                Why does Python code run faster in a function?
                            
                                How can I write a `try`/`except` block that catches all exceptions?
                            
                                How to use glob() to find files recursively?
                            
                                Writing a pandas DataFrame to CSV file
                            
                                Convert hex string to int in Python
                            
                                Traverse a list in reverse order in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With