Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

F#: removing duplicates from a seq is slow

Tags:

performance

f#

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<'a> but with a twist: I need the last duplicate from a run of duplicates to make it into the resulting sequence. For example, if I have a sequence [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)], and I am using fun ((x1, y1),(x2, y2)) -> x1=x2 to check for equality, the result I want to see is [("a", 1); ("b", 4); ("c", 5)]. The point of this function is that I have data points coming in, where sometimes data points legitimately have the same timestamp, but I only care about the latest one, so I want to throw out the preceding ones with the same timestamp. The function I have implemented is as follows:

let rec dedupeTakingLast equalityFn prev s = seq {
     match ( Seq.isEmpty s ) with
     | true -> match prev with 
                 | None -> yield! s
                 | Some value -> yield value
     | false ->
         match prev with 
         | None -> yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s) 
         | Some prevValue -> 
             if not (equalityFn(prevValue, (Seq.head s))) then 
                 yield prevValue
             yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s)
}

In terms of actually doing the job, it works:

> [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)] 
  |> dedupeTakingLast (fun ((x1, y1),(x2, y2)) -> x1=x2) None 
  |> List.ofSeq;;
val it : (string * int) list = [("a", 1); ("b", 4); ("c", 5)]

However, in terms of performance, it's a disaster:

> #time
List.init 1000 (fun _ -> 1) 
|> dedupeTakingLast (fun (x,y) -> x = y) None 
|> List.ofSeq
#time;;    
--> Timing now on    
Real: 00:00:09.958, CPU: 00:00:10.046, GC gen0: 54, gen1: 1, gen2: 1
val it : int list = [1]    
--> Timing now off

Clearly I'm doing something very dumb here, but I cannot see what it is. Where is the performance hit coming from? I realise that there are better implementations, but I am specifically interested in understanding why this implementation is so slow.

EDIT: FYI, managed to eke out a decent implementation in the functional style that relies on Seq. functions only. Performance is OK, taking about 1.6x the time of the imperative-style implementation by @gradbot below that uses iterators. It seems that the root of the problem is the use of Seq.head and Seq.tail in recursive calls in my original effort.

let dedupeTakingLastSeq equalityFn s = 
    s 
    |> Seq.map Some
    |> fun x -> Seq.append x [None]
    |> Seq.pairwise
    |> Seq.map (fun (x,y) -> 
            match (x,y) with 
            | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
            | (_,None) -> x
            | _ -> None )
    |> Seq.choose id
like image 554
mt99 Avatar asked Apr 06 '16 18:04

mt99


3 Answers

The performance issue comes from the nested calls to Seq.tail. Here's the source code to Seq.tail

[<CompiledName("Tail")>]
let tail (source: seq<'T>) =
    checkNonNull "source" source
    seq { use e = source.GetEnumerator() 
          if not (e.MoveNext()) then 
              invalidArg "source" (SR.GetString(SR.notEnoughElements))
          while e.MoveNext() do
              yield e.Current }

If you call Seq.tail(Seq.tail(Seq.tail(...))) the compiler has no way of optimizing out the enumerators that are created by GetEnumerator(). Subsequent returned elements have to go through every nested sequence and enumerator. This results in every returned element having to bubble up through all previously created sequences and all of these sequences have to increment their internal state as well. The net result is a running time of O(n^2) instead of linear O(n).

Unfortunately there is currently no way to represent this in a functional style in F#. You can with a list (x::xs) but not for a sequence. Until the language gets better native support for sequences, don't use Seq.tail in recursive functions.

Using a single enumerator will fix the performance problem.

let RemoveDuplicatesKeepLast equals (items:seq<_>) =
    seq {
        use e = items.GetEnumerator()

        if e.MoveNext() then
            let mutable previous = e.Current

            while e.MoveNext() do
                if not (previous |> equals e.Current) then 
                    yield previous
                previous <- e.Current

            yield previous
    }

let test = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
let FirstEqual a b = fst a = fst b

RemoveDuplicatesKeepLast FirstEqual test
|> printf "%A"

// output
// seq [("a", 1); ("b", 4); ("c", 5)]

The first version of this answer has a recursive version of the above code without mutation.

like image 156
gradbot Avatar answered Oct 23 '22 07:10

gradbot


The problem is with how you use sequences. All those yields, heads and tails are spinning a web of iterators branching off of iterators, and when you finally materialize it when you call List.ofSeq, you're iterating through your input sequence way more than you should.

Each of those Seq.heads is not simply taking the first element of a sequence - it's taking the first element of the tail of a sequence of a tail of a sequence of tail of a sequence and so on.

Check this out - it'll count the times the element constructor is called:

let count = ref 0

Seq.init 1000 (fun i -> count := !count + 1; 1) 
|> dedupeTakingLast (fun (x,y) -> x = y) None 
|> List.ofSeq

Incidentally, just switching out all the Seqs to Lists makes it go instantly.

like image 6
scrwtp Avatar answered Oct 23 '22 07:10

scrwtp


Seq.isEmpty, Seq.head and Seq.tail are slow because they all create a new Enumerator instance which it then calls into. You end up with a lot of GC.

Generally, Sequences are forward only, and if you use them 'like pattern matching for lists', the performance becomes really shoddy.

Looking a bit at your code... | None -> yield! s creates a new Enumerator even though we know s is empty. Every recursive call probably ends up creating a new IEnumerable that is then directly turned into an Enumerator from the call-site with yield!.

like image 4
Henrik Avatar answered Oct 23 '22 07:10

Henrik