Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does filter(_:)’s predicate get called so many times when evaluating it lazily?

Tags:

swift

I saw an answer to this question, which, in it’s first revision, had code similar to this:

let numbers = Array(0 ..< 50)

let result = numbers.lazy
    .filter {
        // gets called 2-3x per element in the range (0...15)!
        print("Calling filter for: \($0)")
        return $0 % 3 == 0
    }
    .prefix(5)

print(Array(result)) // [0, 3, 6, 9, 12]

Which, through the use of a lazy filter collection, is able to filter the first 5 elements of numbers that satisfy a given predicate (in this case, being divisible by 3), without having to evaluate every element in the numbers array.

However, the answer then remarked that filter(_:)’s predicate could be called multiple times per element (3 times for elements in the range 1...15, and twice for 0 as it turns out).

What’s the reason for this inefficiency in the lazy evaluation of this filter? Is there any way to avoid the evaluation of the same element multiple times?

like image 653
Hamish Avatar asked Jan 30 '17 15:01

Hamish


1 Answers

The Problems

The first culprit here is the slicing of the lazy filter collection through the use of prefix(_:) – which, in this case, returns a BidirectionalSlice of the LazyFilterBidirectionalCollection.

In general, the slicing of a Collection entails the storing of the base collection, along with the range of indices that are valid for the slice to ‘view’. Therefore, in order to create a slice of a LazyFilterBidirectionalCollection to view the first 5 elements, the range of indices stored must be startIndex ..< indexAfterTheFifthElement.

In order to get indexAfterTheFifthElement, the LazyFilterBidirectionalCollection must iterate through the base collection (numbers) in order to find the 6th element that meets the predicate (you can see the exact implementation of the indexing here).

Therefore all the elements in the range 0...15 from the above example need to be checked against the predicate simply in order to create a slice of the lazy filter collection.

The second culprit is Array’s init(_:), which accepts a Sequence of elements of the same type as the array’s Element type. The implementation of this initialiser calls _copyToContiguousArray() on the sequence, which, for most sequences, forwards the call to this function:

internal func _copySequenceToContiguousArray<S : Sequence>
                                    (_ source: S) -> ContiguousArray<S.Iterator.Element> {

  let initialCapacity = source.underestimatedCount // <- problem here

  var builder =
    _UnsafePartiallyInitializedContiguousArrayBuffer<S.Iterator.Element>(
      initialCapacity: initialCapacity)

  var iterator = source.makeIterator()

  // FIXME(performance): use _copyContents(initializing:).
  // Add elements up to the initial capacity without checking for regrowth.
  for _ in 0..<initialCapacity {
    builder.addWithExistingCapacity(iterator.next()!)
  }

  // Add remaining elements, if any.
  while let element = iterator.next() {
    builder.add(element)
  }

  return builder.finish()
}

The problem here is underestimatedCount. For plain sequences, this just has a default implementation that returns 0 – however, for collections, it has a default implementation which gets the count of the collection (I go into this in more detail here).

The default implementation for the count of a Collection (which BidirectionalSlice will use here) is simply:

public var count: IndexDistance {
    return distance(from: startIndex, to: endIndex)
} 

Which, for our slice, will walk through the indices up to indexAfterTheFifthElement, thus re-evaluating the elements in the range 0...15, again.

Finally, an iterator of the slice is made, and is iterated through, adding each element in the sequence to the new array’s buffer. For a BidirectionalSlice, this will use an IndexingIterator, which simply works by advancing indices and outputting the element for each index.

The reason why this walk over the indices doesn’t re-evaluate the elements up to the first element of the result (note in the question’s example, 0 is evaluated one time less) is due to the fact that it doesn’t directly access the startIndex of the LazyFilterBidirectionalCollection, which has to evaluate all elements up to the first element in the result. Instead, the iterator can work from the start index of the slice itself.


The Solution

The simple solution is to avoid slicing the lazy filter collection in order to get its prefix, but instead apply the prefixing lazily.

There are actually two implementations of prefix(_:). One is provided by Collection, and returns a SubSequence (which is some form of slice for most standard library collections).

The other is provided by Sequence, that returns an AnySequence – which, under the hood uses a base sequence of _PrefixSequence, which simply takes an iterator and allows iteration through it until a given number of elements have been iterated over – then just stops returning elements.

For lazy collections, this implementation of prefix(_:) is great, as it doesn’t require any indexing – it just lazily applies the prefixing.

Therefore, if you say:

let result : AnySequence = numbers.lazy
    .filter {
        // gets called 1x per element :)
        print("Calling filter for: \($0)")
        return $0 % 3 == 0
    }
    .prefix(5)

The elements of numbers (up to the 5th match) will only be evaluated once by filter(_:)’s predicate when passed to Array's initialiser, as you’re forcing Swift to use Sequence’s default implementation of prefix(_:).

The foolproof way of preventing all indexing operations on a given lazy filter collection is to simply use a lazy filter sequence instead – this can be done by just wrapping the collection you wish to perform lazy operations on in an AnySequence:

let numbers = Array(0 ..< 50)

let result = AnySequence(numbers).lazy
    .filter {
        // gets called 1x per element :)
        print("Calling filter for: \($0)")
        return $0 % 3 == 0
    }
    .dropFirst(5) // neither of these will do indexing,
    .prefix(5)    // but instead return a lazily evaluated AnySequence.


print(Array(result)) // [15, 18, 21, 24, 27]

However note that for bidirectional collections, this may have an adverse effect for operations on the end of the collection – as then the entire sequence will have to be iterated through in order to reach the end.

For such operations as suffix(_:) and dropLast(_:), it may well be more efficient to work with a lazy collection over a sequence (at least for small inputs), as they can simply index from the end of the collection.

Although, as with all performance related concerns, you should first check to see if this is even a problem in the first place, and then secondly run your own tests to see which method is better for your implementation.


Conclusion (TL;DR)

So, after all of that – the lesson to learn here is that you should be wary of the fact that slicing a lazy filter collection can re-evaluate every element of the base collection up to the end index which the slice can ‘view’.

Often it’s more desirable to treat a lazy filter collection as a sequence instead, which cannot be indexed, therefore meaning that lazy operations cannot evaluate any elements (doing so would risk destructively iterating over them) until an eager operation happens.

However, you should then be wary of the fact that you’re potentially sacrificing being able to index the collection from the end, which is important for operations such as suffix(_:).

Finally, it’s worth noting that this isn't a problem with lazy views such as LazyMapCollection, as their elements don’t depend on the ‘results’ of previous elements – therefore they can be indexed in constant time if their base collection is a RandomAccessCollection.

like image 66
Hamish Avatar answered Nov 19 '22 03:11

Hamish