I need a method that can split Iterator[Char] into lines (separated by \n and \r)
For that, I wrote a general method that gets an iterator and a predicate and will split the iterator every time the predicate is true.
This is similar to span, but will split every time the predicate is true, not only the first time
this is my implementation:
def iterativeSplit[T](iterO: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
private var iter = iterO
def hasNext = iter.hasNext
def next = {
val (i1,i2) = iter.span(el => !breakOn(el))
val cur = i1.toList
iter = i2.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
and it works well on small inputs, but on larges inputs, this runs very slow, and sometimes I get stack overflow exception
here is the code that recreates the issue:
val iter = ("aaaaaaaaabbbbbbbbbbbccccccccccccc\r\n" * 10000).iterator
iterativeSplit(iter)(c => c == '\r' || c == '\n').length
the stack trace during the run is:
...
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
...
looking at the source code (I'm using scala 2.10.4)
line 591 is the hasNext of the second iterator from the span, and line 651 is the hasNext in the iterator from dropWhile
I guess I'm using those 2 iterators incorrectly, but I can't see why
You can simplify your code as follows, which seems to solve the problem:
def iterativeSplit2[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_)).toList
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
Rather than using span (so you need to replace iter on each call to next), simply use takeWhile and dropWhile on the original iter. Then there's no need for the var.
I think the cause of your original stack overflow is that repeatedly calling span creates a long chain of Iterators, each of whose hasNext methods calls the hasNext of its parent Iterator. If you look at the source code for Iterator, you can see that each span creates new Iterators that forward calls to hasNext to the original iterator (via a BufferedIterator, which increases the call stack even further).
Update having consulted the documentation it seems that, although my solution above appears to work, it is not recommended - see particularly:
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. [...] Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
which applies to takeWhile and dropWhile (and span), but not next or hasNext.
It's possible to use span as in your original solution, but using streams rather than iterators, and recursion:
def split3[T](s: Stream[T])(breakOn: T => Boolean): Stream[List[T]] = s match {
case Stream.Empty => Stream.empty
case s => {
val (a, b) = s.span(!breakOn(_))
a.toList #:: split3(b.dropWhile(breakOn))(breakOn)
}
}
But the performance is pretty terrible. I'm sure there must be a better way...
Update 2: Here is a very imperative solution that has better performance:
import scala.collection.mutable.ListBuffer
def iterativeSplit4[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
val word = new ListBuffer[T]
def hasNext() = iter.hasNext
def next = {
var looking = true
while (looking) {
val c = iter.next
if (breakOn(c)) looking = false
else word += c
}
val w = word.toList
word.clear()
w
}
}.withFilter(_.nonEmpty)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With