Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is using a sequence so much slower than using a list in this example

Tags:

Background: I have a sequence of contiguous, time-stamped data. The data-sequence has holes in it, some large, others just a single missing value.
Whenever the hole is just a single missing value, I want to patch the holes using a dummy-value (larger holes will be ignored).

I would like to use lazy generation of the patched sequence, and I am thus using Seq.unfold.

I have made two versions of the method to patch the holes in the data.

The first consumes the sequence of data with holes in it and produces the patched sequence. This is what i want, but the methods runs horribly slow when the number of elements in the input sequence rises above 1000, and it gets progressively worse the more elements the input sequence contains.

The second method consumes a list of the data with holes and produces the patched sequence and it runs fast. This is however not what I want, since this forces the instantiation of the entire input-list in memory.

I would like to use the (sequence -> sequence) method rather than the (list -> sequence) method, to avoid having the entire input-list in memory at the same time.

Questions:

1) Why is the first method so slow (getting progressively worse with larger input lists) (I am suspecting that it has to do with repeatedly creating new sequences with Seq.skip 1, but I am not sure)

2) How can I make the patching of holes in the data fast, while using an input sequence rather than an input list?

The code:

open System

// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        if restOfValues |> Seq.isEmpty then
            None // Reached the end of the input seq
        else
            let currentValue = Seq.hd restOfValues
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues  )) // Only happens to the first item in the seq
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
                    Some(dummyValue, (Some(dummyValue), restOfValues))
                else
                    Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        match restOfValues with
        | [] -> None // Reached the end of the input list
        | currentValue::restOfValues -> 
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), restOfValues  )) // Only happens to the first item in the list
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) 
                    Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
                else
                    Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}

let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items

let timeBetweenContiguousValues = (new TimeSpan(0,1,0))

// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))

// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
like image 616
Treefrog Avatar asked Aug 20 '09 13:08

Treefrog


2 Answers

Any time you break apart a seq using Seq.hd and Seq.skip 1 you are almost surely falling into the trap of going O(N^2). IEnumerable<T> is an awful type for recursive algorithms (including e.g. Seq.unfold), since these algorithms almost always have the structure of 'first element' and 'remainder of elements', and there is no efficient way to create a new IEnumerable that represents the 'remainder of elements'. (IEnumerator<T> is workable, but its API programming model is not so fun/easy to work with.)

If you need the original data to 'stay lazy', then you should use a LazyList (in the F# PowerPack). If you don't need the laziness, then you should use a concrete data type like 'list', which you can 'tail' into in O(1).

(You should also check out Avoiding stack overflow (with F# infinite sequences of sequences) as an FYI, though it's only tangentially applicable to this problem.)

like image 68
Brian Avatar answered Dec 29 '22 18:12

Brian


Seq.skip constructs a new sequence. I think that is why your original approach is slow.

My first inclination is to use a sequence expression and Seq.pairwise. This is fast and easy to read.

let insertDummyValuesWhereASingleValueIsMissingSeq (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
  let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
  seq {
    yield Seq.hd values
    for ((prevTime, _), ((currentTime, _) as next)) in Seq.pairwise values do
      let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
      if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
        let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
        yield dummyValue
      yield next
  }
like image 33
Jason Avatar answered Dec 29 '22 19:12

Jason