I need a method that can split Iterator[Char]
into lines (separated by \n
and \r
)
For that, I wrote a general method that gets an iterator and a predicate and will split the iterator every time the predicate is true.
This is similar to span
, but will split every time the predicate is true, not only the first time
this is my implementation:
def iterativeSplit[T](iterO: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
private var iter = iterO
def hasNext = iter.hasNext
def next = {
val (i1,i2) = iter.span(el => !breakOn(el))
val cur = i1.toList
iter = i2.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
and it works well on small inputs, but on larges inputs, this runs very slow, and sometimes I get stack overflow exception
here is the code that recreates the issue:
val iter = ("aaaaaaaaabbbbbbbbbbbccccccccccccc\r\n" * 10000).iterator
iterativeSplit(iter)(c => c == '\r' || c == '\n').length
the stack trace during the run is:
...
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
...
looking at the source code (I'm using scala 2.10.4)
line 591 is the hasNext
of the second iterator from the span
, and line 651 is the hasNext
in the iterator from dropWhile
I guess I'm using those 2 iterators incorrectly, but I can't see why
You can simplify your code as follows, which seems to solve the problem:
def iterativeSplit2[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
def hasNext = iter.hasNext
def next = {
val cur = iter.takeWhile(!breakOn(_)).toList
iter.dropWhile(breakOn)
cur
}
}.withFilter(l => l.nonEmpty)
Rather than using span
(so you need to replace iter
on each call to next
), simply use takeWhile
and dropWhile
on the original iter
. Then there's no need for the var
.
I think the cause of your original stack overflow is that repeatedly calling span
creates a long chain of Iterator
s, each of whose hasNext
methods calls the hasNext
of its parent Iterator
. If you look at the source code for Iterator
, you can see that each span
creates new Iterators that forward calls to hasNext
to the original iterator (via a BufferedIterator
, which increases the call stack even further).
Update having consulted the documentation it seems that, although my solution above appears to work, it is not recommended - see particularly:
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. [...] Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
which applies to takeWhile
and dropWhile
(and span
), but not next
or hasNext
.
It's possible to use span
as in your original solution, but using streams rather than iterators, and recursion:
def split3[T](s: Stream[T])(breakOn: T => Boolean): Stream[List[T]] = s match {
case Stream.Empty => Stream.empty
case s => {
val (a, b) = s.span(!breakOn(_))
a.toList #:: split3(b.dropWhile(breakOn))(breakOn)
}
}
But the performance is pretty terrible. I'm sure there must be a better way...
Update 2: Here is a very imperative solution that has better performance:
import scala.collection.mutable.ListBuffer
def iterativeSplit4[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] =
new Iterator[List[T]] {
val word = new ListBuffer[T]
def hasNext() = iter.hasNext
def next = {
var looking = true
while (looking) {
val c = iter.next
if (breakOn(c)) looking = false
else word += c
}
val w = word.toList
word.clear()
w
}
}.withFilter(_.nonEmpty)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With