I have an iterator of strings, where each string can be either <code>"H"</code> (header) or <code>"D"</code> (detail). I want to split this iterator into blocks, where each block starts with one header and can have 0 to many detail. I know how to solve this problem loading everything into memory. For example, the code bellow: <pre class="prettyprint lang-scala prettyprint-override"><code>Seq("H","D","D","D","H","D","H","H","D","D","H","D").toIterator .foldLeft(List[List[String]]())((acc, x) => x match { case "H" => List(x) :: acc case "D" => (x :: acc.head) :: acc.tail }) .map(_.reverse) .reverse </code></pre> returns 5 blocks - <code>List(List(H, D, D, D), List(H, D), List(H), List(H, D, D), List(H, D))</code> - which is what I want. However, instead of <code>List[List[String]]</code> in the result, I want either <code>Iterator[List[String]]</code> or some other structure that allows me to evaluate the result lazily and do not load the entire input into memory if the entire iterator in consumed, I want to load into memory only the block being consumed at a time (e.g.: when I call <code>iterator.next</code>). How can I modify the code above to achieve the result I want? EDIT: I need this in Scala 2.11 specifically, as the environment I use sticks to it. Glad to also accept answers for other versions though.

Here is the simplest implementation I could find (It's generic and lazy): <pre class="prettyprint lang-scala prettyprint-override"><code>/** takes 'it' and groups consecutive elements * until next item that satisfy 'startGroup' predicate occures. * It returns Iterator[List[T]] and is lazy * (keeps in memory only last group, not whole 'it'). */ def groupUsing[T](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = { val sc = it.scanLeft(List.empty[T]) { (a,b) => if (startGroup(b)) b::Nil else b::a } (sc ++ Iterator(Nil)).sliding(2,1).collect { case Seq(a,b) if a.length >= b.length => a.reverse } } </code></pre> use it like that: <pre class="prettyprint lang-scala prettyprint-override"><code>val exampleIt = Seq("H1","D1","D2","D3","H2","D4","H3","H4","D5","D6","H5","D7").toIterator groupUsing(exampleIt)(_.startsWith("H")) // H1 D1 D2 D3 / H2 D4 / H3 / H4 D5 D6 / H5 D7 </code></pre> here is specyfication: <pre class="prettyprint"><code>X | GIVEN | EXPECTED | O | | | empty iterator O | H | H | single header O | D | D | single item (not header) O | HD | HD | O | HH | H,H | only headers O | HHD | H,HD | O | HDDDHD | HDDD,HD | O | DDH | DD,H | heading D's have no Header as you can see. O | HDDDHDHDD | HDDD,HD,HDD | </code></pre> scalafiddle with tests and additional comments: https://scalafiddle.io/sf/q8xbQ9N/11 (if answer is helpful up-vote please. I spent a little too much time on it :)) SECOND IMPLEMENTATION: You have propose version that does not use <code>sliding</code>. Here it is, but it has its own problems listed below. <pre class="prettyprint lang-scala prettyprint-override"><code>def groupUsing2[T >: Null](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = { type TT = (List[T], List[T]) val empty:TT = (Nil, Nil) //We need this ugly `++ Iterator(null)` to close last group. val sc = (it ++ Iterator(null)).scanLeft(empty) { (a,b) => if (b == null || startGroup(b)) (b::Nil, a._1) else (b::a._1, Nil) } sc.collect { case (_, a) if a.nonEmpty => a.reverse } } </code></pre> Traits: <ul> <li>(-) It works only for <code>T>:Null</code> types. We just need to add element that will close last collection on the end (null is perfect but it limits our type).</li> <li>(~) it should create same amount of trsh as previous version. We just create tuples in first step instead of second one.</li> <li>(+) it does not check length of List (and this is big gain to be honest).</li> <li>(+) In core it is Ivan Kurchenko answer but without extra boxing.</li> </ul> Here is scalafiddle: https://scalafiddle.io/sf/q8xbQ9N/11

How to fold a Scala iterator and get a lazily evaluated sequence as result?

Tags:

scala

I have an iterator of strings, where each string can be either "H" (header) or "D" (detail). I want to split this iterator into blocks, where each block starts with one header and can have 0 to many detail.

I know how to solve this problem loading everything into memory. For example, the code bellow:

Seq("H","D","D","D","H","D","H","H","D","D","H","D").toIterator
  .foldLeft(List[List[String]]())((acc, x) => x match {
    case "H" => List(x) :: acc
    case "D" => (x :: acc.head) :: acc.tail })
  .map(_.reverse)
  .reverse

returns 5 blocks - List(List(H, D, D, D), List(H, D), List(H), List(H, D, D), List(H, D)) - which is what I want.

However, instead of List[List[String]] in the result, I want either Iterator[List[String]] or some other structure that allows me to evaluate the result lazily and do not load the entire input into memory if the entire iterator in consumed, I want to load into memory only the block being consumed at a time (e.g.: when I call iterator.next).

How can I modify the code above to achieve the result I want?

EDIT: I need this in Scala 2.11 specifically, as the environment I use sticks to it. Glad to also accept answers for other versions though.

514

asked Feb 11 '20 17:02

mvallebr

3 Answers

If you're using Scala 2.13.x then you might create a new Iterator by unfolding over the original Iterator.

import scala.collection.mutable.ListBuffer

val data = Seq("H","D","D","D","H","D","H","H","D","D","H","D").iterator

val rslt = Iterator.unfold(data.buffered){itr =>
  Option.when(itr.hasNext) {
    val lb = ListBuffer(itr.next())
    while (itr.hasNext && itr.head == "D")
      lb += itr.next()
    (lb.toList, itr)
  }
}

testing:

rslt.next()   //res0: List[String] = List(H, D, D, D)
rslt.next()   //res1: List[String] = List(H, D)
rslt.next()   //res2: List[String] = List(H)
rslt.next()   //res3: List[String] = List(H, D, D)
rslt.next()   //res4: List[String] = List(H, D)
rslt.hasNext  //res5: Boolean = false

106

answered Oct 21 '22 11:10

jwvh

Here is the simplest implementation I could find (It's generic and lazy):

/** takes 'it' and groups consecutive elements 
 *  until next item that satisfy 'startGroup' predicate occures. 
 *  It returns Iterator[List[T]] and is lazy 
 *  (keeps in memory only last group, not whole 'it'). 
*/
def groupUsing[T](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = {
  val sc = it.scanLeft(List.empty[T]) {
    (a,b) => if (startGroup(b)) b::Nil else b::a
  }

  (sc ++ Iterator(Nil)).sliding(2,1).collect { 
    case Seq(a,b) if a.length >= b.length => a.reverse
  }
}

use it like that:

val exampleIt = Seq("H1","D1","D2","D3","H2","D4","H3","H4","D5","D6","H5","D7").toIterator
groupUsing(exampleIt)(_.startsWith("H"))
// H1 D1 D2 D3 / H2 D4 / H3 / H4 D5 D6 / H5 D7

here is specyfication:

X | GIVEN            | EXPECTED     |
O |                  |              | empty iterator
O | H                | H            | single header
O | D                | D            | single item (not header)
O | HD               | HD           |
O | HH               | H,H          | only headers
O | HHD              | H,HD         |
O | HDDDHD           | HDDD,HD      |
O | DDH              | DD,H         | heading D's have no Header as you can see.
O | HDDDHDHDD        | HDDD,HD,HDD  |

scalafiddle with tests and additional comments: https://scalafiddle.io/sf/q8xbQ9N/11

(if answer is helpful up-vote please. I spent a little too much time on it :))

SECOND IMPLEMENTATION:

You have propose version that does not use sliding. Here it is, but it has its own problems listed below.

def groupUsing2[T >: Null](it:Iterator[T])(startGroup:T => Boolean):Iterator[List[T]] = {
  type TT = (List[T], List[T])
  val empty:TT = (Nil, Nil)
  //We need this ugly `++ Iterator(null)` to close last group.
  val sc = (it ++ Iterator(null)).scanLeft(empty) {
    (a,b) => if (b == null || startGroup(b)) (b::Nil, a._1) else (b::a._1, Nil)
  }

  sc.collect { 
    case (_, a) if a.nonEmpty => a.reverse
  }
}

Traits:

(-) It works only for T>:Null types. We just need to add element that will close last collection on the end (null is perfect but it limits our type).
(~) it should create same amount of trsh as previous version. We just create tuples in first step instead of second one.
(+) it does not check length of List (and this is big gain to be honest).
(+) In core it is Ivan Kurchenko answer but without extra boxing.

Here is scalafiddle: https://scalafiddle.io/sf/q8xbQ9N/11

answered Oct 21 '22 11:10

Scalway

I think scanLeft operation might help in this case, if you would like use Scala 2.11 version.

I would like to come up with next solution, but I'm afraid it look more complicated then the original one:

def main(args: Array[String]): Unit = {
    sealed trait SequenceItem
    case class SequenceSymbol(value: String) extends SequenceItem
    case object Termination extends SequenceItem

    /**
      * _1 - HD sequence in progress
      * _2 - HD sequences which is ready
      */
    type ScanResult = (List[String], List[String])
    val init: ScanResult = Nil -> Nil

    val originalIterator: Iterator[SequenceItem] = Seq("H","D","D","D", "H","D", "H", "H","D","D", "H","D")
      .toIterator.map(SequenceSymbol)

    val iteratorWithTermination: Iterator[SequenceItem] = originalIterator ++ Seq(Termination).toIterator
    val result: Iterator[List[String]] = iteratorWithTermination
      .scanLeft(init) {
        case ((progress, _), SequenceSymbol("H")) =>  List("H") -> progress
        case ((progress, _), SequenceSymbol("D")) => ("D" :: progress) -> Nil
        case ((progress, _), Termination) => Nil -> progress
      }
      .collect {
        case (_, ready) if ready.nonEmpty => ready
      }
      .map(_.reverse)

    println(result.mkString(", "))
  }

Types added for example readability. Hope this help!

answered Oct 21 '22 12:10

Ivan Kurchenko

Related questions
                            
                                Akka testing supervisor error handling
                            
                                Why are scaladoc method signatures wrong?
                            
                                Why can't _ be used to indicate an unused/ignored argument in a method override?
                            
                                Travis CI ignoring MAVEN_OPTS?
                            
                                Spark JSON text field to RDD
                            
                                scala : it is impossible to put a tuple as a function's argument
                            
                                Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
                            
                                Function implicit parameters not any more so after passing it to a higher order function
                            
                                Shading over third party classes
                            
                                Does a flatMap in spark cause a shuffle?
                            
                                Play Scala Dependency injection: How to use it
                            
                                How to use Spark's repartitionAndSortWithinPartitions?
                            
                                How to read in-memory JSON string into Spark DataFrame
                            
                                Scala Compilation Error : Value += is not member of Int
                            
                                Convert List into dataframe spark scala
                            
                                How to read simple text file from Google Cloud Storage using Spark-Scala local Program
                            
                                POST request using play ws in Scala
                            
                                Scala, spray-json: universal enumeration json formatting
                            
                                How to disable the method return type hint in IntellijIdea scala plugin
                            
                                Spark java : Creating a new Dataset with a given schema

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With