There is a list L. It contains elements of arbitrary type each. How to delete all duplicate elements in such list efficiently? ORDER must be preserved Just an algorithm is required, so no import any external library is allowed. <h3>Related questions</h3> <ul> <li> In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order? </li> <li> How do you remove duplicates from a list in Python whilst preserving order? </li> <li> Removing duplicates from list of lists in Python </li> <li> How do you remove duplicates from a list in Python? </li> </ul>

Assuming order matters: <ul> <li>Create an empty set S and an empty list M.</li> <li>Scan the list L one element at a time.</li> <li>If the element is in the set S, skip it.</li> <li>Otherwise, add it to M and to S.</li> <li>Repeat for all elements in L.</li> <li>Return M.</li> </ul> In Python: <pre class="prettyprint"><code>>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5] >>> S = set() >>> M = [] >>> for e in L: ... if e in S: ... continue ... S.add(e) ... M.append(e) ... >>> M [2, 1, 4, 3, 5, 6] </code></pre> If order does not matter: <pre class="prettyprint"><code>M = list(set(L)) </code></pre>

<h3>Special Case: Hashing and Equality</h3> Firstly, we need to determine something about the assumptions, namely the existence of an equals and has function relationship. What do I mean by this? I mean that for the set of source objects S, given any two objects x1 and x2 that are elements of S there exists a (hash) function F such that: <pre class="prettyprint"><code>if (x1.equals(x2)) then F(x1) == F(x2) </code></pre> Java has such a relationship. That allows you to check to duplicates as a near O(1) operation and thus reduces the algorithm to a simple O(n) problem. If order is unimportant, it's a simple one liner: <pre class="prettyprint"><code>List result = new ArrayList(new HashSet(inputList)); </code></pre> If order is important: <pre class="prettyprint"><code>List outputList = new ArrayList(); Set set = new HashSet(); for (Object item : inputList) { if (!set.contains(item)) { outputList.add(item); set.add(item); } } </code></pre> You will note that I said "near O(1)". That's because such data structures (as a Java HashMap or HashSet) rely on a method where a portion of the hash code is used to find an element (often called a bucket) in the backing storage. The number of buckets is a power-of-2. That way the index into that list is easy to calculate. hashCode() returns an int. If you have 16 buckets you can find which one to use by ANDing the hashCode with 15, giving you a number from 0 to 15. When you try and put something in that bucket it may already be occupied. If so then a linear comparison of all entries in that bucket will occur. If the collision rate gets too high or you try to put too many elements in the structure will be grown, typically doubled (but always by a power-of-2) and all the items are placed in their new buckets (based on the new mask). Thus resizing such structures is relatively expensive. Lookup may also be expensive. Consider this class: <pre class="prettyprint"><code>public class A { private final int a; A(int a) { this.a == a; } public boolean equals(Object ob) { if (ob.getClass() != getClass()) return false; A other = (A)ob; return other.a == a; } public int hashCode() { return 7; } } </code></pre> This code is perfectly legal and it fulfills the equals-hashCode contract. Assuming your set contains nothing but A instances, your insertion/search now turns into an O(n) operation, turning the entire insertion into O(n2). Obviously this is an extreme example but it's useful to point out that such mechanisms also rely on a relatively good distribution of hashes within the value space the map or set uses. Finally, it must be said that this is a special case. If you're using a language without this kind of "hashing shortcut" then it's a different story. <h3>General Case: No Ordering</h3> If no ordering function exists for the list then you're stuck with an O(n2) brute-force comparison of every object to every other object. So in Java: <pre class="prettyprint"><code>List result = new ArrayList(); for (Object item : inputList) { boolean duplicate = false; for (Object ob : result) { if (ob.equals(item)) { duplicate = true; break; } } if (!duplicate) { result.add(item); } } </code></pre> <h3>General Case: Ordering</h3> If an ordering function exists (as it does with, say, a list of integers or strings) then you sort the list (which is O(n log n)) and then compare each element in the list to the next (O(n)) so the total algorithm is O(n log n). In Java: <pre class="prettyprint"><code>Collections.sort(inputList); List result = new ArrayList(); Object prev = null; for (Object item : inputList) { if (!item.equals(prev)) { result.add(item); } prev = item; } </code></pre> Note: the above examples assume no nulls are in the list.

Algorithm - How to delete duplicate elements in a list efficiently?

2 Answers

Assuming order matters:

Create an empty set S and an empty list M.
Scan the list L one element at a time.
If the element is in the set S, skip it.
Otherwise, add it to M and to S.
Repeat for all elements in L.
Return M.

In Python:

>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> S = set()
>>> M = []
>>> for e in L:
...     if e in S:
...         continue
...     S.add(e)
...     M.append(e)
... 
>>> M
[2, 1, 4, 3, 5, 6]

If order does not matter:

M = list(set(L))

answered Oct 21 '22 08:10

FogleBird

Special Case: Hashing and Equality

Firstly, we need to determine something about the assumptions, namely the existence of an equals and has function relationship. What do I mean by this? I mean that for the set of source objects S, given any two objects x1 and x2 that are elements of S there exists a (hash) function F such that:

if (x1.equals(x2)) then F(x1) == F(x2)

Java has such a relationship. That allows you to check to duplicates as a near O(1) operation and thus reduces the algorithm to a simple O(n) problem. If order is unimportant, it's a simple one liner:

List result = new ArrayList(new HashSet(inputList));

If order is important:

List outputList = new ArrayList();
Set set = new HashSet();
for (Object item : inputList) {
  if (!set.contains(item)) {
    outputList.add(item);
    set.add(item);
  }
}

You will note that I said "near O(1)". That's because such data structures (as a Java HashMap or HashSet) rely on a method where a portion of the hash code is used to find an element (often called a bucket) in the backing storage. The number of buckets is a power-of-2. That way the index into that list is easy to calculate. hashCode() returns an int. If you have 16 buckets you can find which one to use by ANDing the hashCode with 15, giving you a number from 0 to 15.

When you try and put something in that bucket it may already be occupied. If so then a linear comparison of all entries in that bucket will occur. If the collision rate gets too high or you try to put too many elements in the structure will be grown, typically doubled (but always by a power-of-2) and all the items are placed in their new buckets (based on the new mask). Thus resizing such structures is relatively expensive.

Lookup may also be expensive. Consider this class:

public class A {
  private final int a;

  A(int a) { this.a == a; }

  public boolean equals(Object ob) {
    if (ob.getClass() != getClass()) return false;
    A other = (A)ob;
    return other.a == a;
  }

  public int hashCode() { return 7; }
}

This code is perfectly legal and it fulfills the equals-hashCode contract.

Assuming your set contains nothing but A instances, your insertion/search now turns into an O(n) operation, turning the entire insertion into O(n²).

Obviously this is an extreme example but it's useful to point out that such mechanisms also rely on a relatively good distribution of hashes within the value space the map or set uses.

Finally, it must be said that this is a special case. If you're using a language without this kind of "hashing shortcut" then it's a different story.

General Case: No Ordering

If no ordering function exists for the list then you're stuck with an O(n²) brute-force comparison of every object to every other object. So in Java:

List result = new ArrayList();
for (Object item : inputList) {
  boolean duplicate = false;
  for (Object ob : result) {
    if (ob.equals(item)) {
      duplicate = true;
      break;
    }
  }
  if (!duplicate) {
    result.add(item);
  }
}

General Case: Ordering

If an ordering function exists (as it does with, say, a list of integers or strings) then you sort the list (which is O(n log n)) and then compare each element in the list to the next (O(n)) so the total algorithm is O(n log n). In Java:

Collections.sort(inputList);
List result = new ArrayList();
Object prev = null;
for (Object item : inputList) {
  if (!item.equals(prev)) {
    result.add(item);
  }
  prev = item;
}

Note: the above examples assume no nulls are in the list.

answered Oct 21 '22 06:10

cletus

Related questions
                            
                                Where is the JRE lib/security directory on Mac OS X?
                            
                                How to convert byte array to hex format in Java
                            
                                Java - What is this asking me to do?
                            
                                Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch - Heroku
                            
                                Why aren't variables in Java volatile by default?
                            
                                How can I express that two values are not equal to eachother?
                            
                                How to put spaces in a stringbuilder
                            
                                Java Thread priority has no effect
                            
                                changing the order of jars in eclipse project
                            
                                Sorting a list in Java using 2 criteria
                            
                                In Java, remove the first char of the string if it is , (comma)
                            
                                How do I iterate over Binary Tree?
                            
                                TimeZone validation in Java
                            
                                How to get a meaningful result from subtracting 2 nanoTime objects?
                            
                                checking an integer to see if it contains a zero
                            
                                ksoap2 org.xmlpull.v1.xmlpullparserexception expected start_tag error
                            
                                Cannot deploy an application to Glassfish 4.1 in Eclipse
                            
                                SimpleDateFormat gives API Error
                            
                                import com.google.android.maps.geopoint cannot be resolved
                            
                                How can I find the target Java version for a compiled class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm - How to delete duplicate elements in a list efficiently?

Tags:

java

c++

python

algorithm

haskell

Related questions

psihodelia

People also ask