Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find duplicates in an unsorted sequence efficiently

I need a very efficient way to find duplicates in an unsorted sequence. This is what I came up with, but it has a few shortcomings, namely it

  1. unnecessarily counts occurrences beyond 2
  2. consumes the entire sequence before yielding duplicates
  3. creates several intermediate sequences

module Seq = 
  let duplicates items =
    items
    |> Seq.countBy id
    |> Seq.filter (snd >> ((<) 1))
    |> Seq.map fst

Regardless of the shortcomings, I don't see a reason to replace this with twice the code. Is it possible to improve this with comparably concise code?

like image 217
Daniel Avatar asked Mar 14 '12 19:03

Daniel


People also ask

How are duplicates removed from an unsorted array without using any library?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays.

How do you find repeated numbers in an array if it contains multiple duplicates?

Duplicate elements can be found using two loops. The outer loop will iterate through the array from 0 to length of the array. The outer loop will select an element. The inner loop will be used to compare the selected element with the rest of the elements of the array.

How do you find duplicates from an unsorted array in C?

To find duplicate elements in given array we need two loops. Run an outer loop from 0 to size . Loop structure must look like for(i=0; i<size; i++) . This loop is used to select each element of array and check next subsequent elements for duplicates using another nested loop.


3 Answers

A more elegant functional solution:

let duplicates xs =
  Seq.scan (fun xs x -> Set.add x xs) Set.empty xs
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) -> if Set.contains x xs then Some x else None)

Uses scan to accumulate sets of all elements seen so far. Then uses zip to combine each element with the set of elements before it. Finally, uses choose to filter out the elements that are in the set of previously-seen elements, i.e. the duplicates.

EDIT

Actually my original answer was completely wrong. Firstly, you don't want duplicates in your outputs. Secondly, you want performance.

Here is a purely functional solution that implements the algorithm you're after:

let duplicates xs =
  (Map.empty, xs)
  ||> Seq.scan (fun xs x ->
      match Map.tryFind x xs with
      | None -> Map.add x false xs
      | Some false -> Map.add x true xs
      | Some true -> xs)
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) ->
      match Map.tryFind x xs with
      | Some false -> Some x
      | None | Some true -> None)

This uses a map to track whether each element has been seen before once or many times and then emits the element if it is seen having only been seen once before, i.e. the first time it is duplicated.

Here is a faster imperative version:

let duplicates (xs: _ seq) =
  seq { let d = System.Collections.Generic.Dictionary(HashIdentity.Structural)
        let e = xs.GetEnumerator()
        while e.MoveNext() do
          let x = e.Current
          let mutable seen = false
          if d.TryGetValue(x, &seen) then
            if not seen then
              d.[x] <- true
              yield x
          else
            d.[x] <- false }

This is around 2× faster than any of your other answers (at the time of writing).

Using a for x in xs do loop to enumerate the elements in a sequence is substantially slower than using GetEnumerator directly but generating your own Enumerator is not significantly faster than using a computation expression with yield.

Note that the TryGetValue member of Dictionary allows me to avoid allocation in the inner loop by mutating a stack allocated value whereas the TryGetValue extension member offered by F# (and used by kvb in his/her answer) allocates its return tuple.

like image 97
J D Avatar answered Oct 27 '22 00:10

J D


Here's an imperative solution (which is admittedly slightly longer):

let duplicates items =
    seq {
        let d = System.Collections.Generic.Dictionary()
        for i in items do
            match d.TryGetValue(i) with
            | false,_    -> d.[i] <- false         // first observance
            | true,false -> d.[i] <- true; yield i // second observance
            | true,true  -> ()                     // already seen at least twice
    }
like image 38
kvb Avatar answered Oct 26 '22 23:10

kvb


This is the best "functional" solution I could come up with that doesn't consume the entire sequence up front.

let duplicates =
    Seq.scan (fun (out, yielded:Set<_>, seen:Set<_>) item -> 
        if yielded.Contains item then
            (None, yielded, seen)
        else
            if seen.Contains item then
                (Some(item), yielded.Add item, seen.Remove item)
            else
                (None, yielded, seen.Add item)
    ) (None, Set.empty, Set.empty)
    >> Seq.Choose (fun (x,_,_) -> x)
like image 35
gradbot Avatar answered Oct 27 '22 00:10

gradbot