Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split an iterator by a predicate

Tags:

iterator

scala

I need a method that can split Iterator[Char] into lines (separated by \n and \r)

For that, I wrote a general method that gets an iterator and a predicate and will split the iterator every time the predicate is true. This is similar to span, but will split every time the predicate is true, not only the first time

this is my implementation:

def iterativeSplit[T](iterO: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] = 
 new Iterator[List[T]] {
  private var iter = iterO
  def hasNext = iter.hasNext

  def next = {
    val (i1,i2) = iter.span(el => !breakOn(el))
    val cur = i1.toList
    iter = i2.dropWhile(breakOn)
    cur
  }
}.withFilter(l => l.nonEmpty)

and it works well on small inputs, but on larges inputs, this runs very slow, and sometimes I get stack overflow exception

here is the code that recreates the issue:

val iter = ("aaaaaaaaabbbbbbbbbbbccccccccccccc\r\n" * 10000).iterator
iterativeSplit(iter)(c => c == '\r' || c == '\n').length

the stack trace during the run is:

... 
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
...

looking at the source code (I'm using scala 2.10.4) line 591 is the hasNext of the second iterator from the span, and line 651 is the hasNext in the iterator from dropWhile

I guess I'm using those 2 iterators incorrectly, but I can't see why

like image 346
lev Avatar asked Jul 05 '15 09:07

lev


1 Answers

You can simplify your code as follows, which seems to solve the problem:

  def iterativeSplit2[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
    new Iterator[List[T]] {
      def hasNext = iter.hasNext

      def next = {
        val cur = iter.takeWhile(!breakOn(_)).toList
        iter.dropWhile(breakOn)
        cur
      }
    }.withFilter(l => l.nonEmpty)   

Rather than using span (so you need to replace iter on each call to next), simply use takeWhile and dropWhile on the original iter. Then there's no need for the var.

I think the cause of your original stack overflow is that repeatedly calling span creates a long chain of Iterators, each of whose hasNext methods calls the hasNext of its parent Iterator. If you look at the source code for Iterator, you can see that each span creates new Iterators that forward calls to hasNext to the original iterator (via a BufferedIterator, which increases the call stack even further).

Update having consulted the documentation it seems that, although my solution above appears to work, it is not recommended - see particularly:

It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. [...] Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.

which applies to takeWhile and dropWhile (and span), but not next or hasNext.

It's possible to use span as in your original solution, but using streams rather than iterators, and recursion:

  def split3[T](s: Stream[T])(breakOn: T => Boolean): Stream[List[T]] = s match {
    case Stream.Empty => Stream.empty
    case s => {
      val (a, b) = s.span(!breakOn(_))
      a.toList #:: split3(b.dropWhile(breakOn))(breakOn)
    }
  }

But the performance is pretty terrible. I'm sure there must be a better way...

Update 2: Here is a very imperative solution that has better performance:

import scala.collection.mutable.ListBuffer

  def iterativeSplit4[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
    new Iterator[List[T]] {
      val word = new ListBuffer[T]
      def hasNext() = iter.hasNext

      def next = {
        var looking = true
        while (looking) {
          val c = iter.next
          if (breakOn(c)) looking = false
          else word += c
        }
        val w = word.toList
        word.clear()
        w
      }
    }.withFilter(_.nonEmpty)
like image 52
DNA Avatar answered Nov 05 '22 09:11

DNA