Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all duplicates from a list?

Consider this test app:

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  // How to implement this function?
end;

var
  Enumerable: IEnumerable<Integer>;
  UniqueEnumerable: IEnumerable<Integer>;
begin
  Enumerable := TCollections.CreateList<Integer>([1, 1, 2, 3, 3, 3, 4]);
  UniqueEnumerable := RemoveDuplicates(Enumerable);
  UniqueEnumerable.ForEach(
    procedure(const I: Integer)
    begin
      WriteLn(I);
    end);
  ReadLn;
end.

How can I implement the RemoveDuplicates function (this is called nub in Haskell)?

like image 675
Jens Mühlenhoff Avatar asked Sep 03 '15 12:09

Jens Mühlenhoff


2 Answers

Use what is already there:

uses
  Spring.Collections,
  Spring.collections.Extensions;

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  Result := TDistinctIterator<Integer>.Create(Input, nil);
end;

This supports lazy evaluation (means that Input is not getting processed before the resulting enumerable is getting processed). It is using a hashset (currently implemented as Dictionary) internally to keep track of the already found items (this happens inside the enumerator).

Why is that important? Because any operation that does a complete enumeration might cause unwanted performance impact if Input involves other costly operations which may by far outweigh any benefits of other approaches of removing duplicates (like putting it into a list and sort that). Also an IEnumerable is not guaranteed to be finite.

If between calling this function and enumerating the result the Input was changed that change is affecting the outcome of your enumeration whereas it would not be the case if you are not supporting lazy evaluation. If you are enumerating multiple times the outcome might be different (i.e. up-to-date) each time.

like image 187
Stefan Glienke Avatar answered Oct 23 '22 01:10

Stefan Glienke


Jens's solution will work, but it has a rather slow running time of O(n2).

A better alternative if you have a long list is to
- Sort the list
- Compare every item with its successor.

This has a running time of O(n log n) for the quicksort + O(n) for the search for a total running time of O(n log n).

See the following pseudo code (don't have access to Delphi now).

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  List: IList<Integer>;
  i: integer;
begin
  List := TCollections.CreateList<Integer>;
  List.Assign(Input); //Copy input list to output.
  List.Sort;
  for i:= List.Count-1 downto 1 do begin
    if List[i] = List[i-1] then List.delete(i); 
    //if Comparer<T>.Equals(List[i], List[i-1]) then ....
  end; {for i}
end;

Problems
The problem with this approach is that the output (might) have a different order from the input. This may or may not be a problem.

Benefits (or why the dictionary sucks)
If the sorting is a cheap operation this will be the fastest approach.
The use of a dictionary carries high constant cost for the hashing.
Even though the hashing operation is O(1), it can get very expensive for large keys, because the hash will always process the whole key, whereas sorting comparison will stop as soon a a difference is detected. Further note that hashing is a much more expensive operation than a simple comparision (about 30x to 100x slower)!

Only when the list is huge does the better asymptotic running time of the dictonary kick in.

like image 38
Johan Avatar answered Oct 23 '22 00:10

Johan