Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge iterator parser

Tags:

iterator

scala

I have an Iterator[(A1,B1)] and two functions

  • fA: (Iterator[A1]) => Iterator[A2] and
  • fB: (Iterator[B1]) => Iterator[B2].

Is it possible to make a fAB: (Iterator[(A1,B1)]) => Iterator[(A2,B2)] without converting Iterators to Seq?

Edit

Both answers below are good. I selected @Aivean's answer because the code is simpler and it uses specialized scala data structure (Stream).

The only drawback is the stackoverfow limitation but it shouldn't be a problem for most use cases. If your iterator can be very (very) long, then @Alexey's solution should be preferred.

like image 711
Juh_ Avatar asked Apr 14 '16 12:04

Juh_


People also ask

What is the difference between iterative quicksort and iterative mergesort?

Unlike Iterative QuickSort, the iterative MergeSort doesn’t require explicit auxiliary stack. Note: In iterative merge sort, we do bottom up approach ie, start from 2 element sized array (we know that 1 element sized array is already sorted).

How to merge an unmerged list at final merge?

To merge this unmerged list at final merge we need to force the mid to be at the start of unmerged list so that it is a candidate for merge. The unmerged sublist element count that will be isolated until final merge call can be found out using the remainder (n % width). The final merge (when we have uneven lists) can be identified by (width>n/2).

How to Pars CSV files in Python?

Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files. There are a variety of formats available for CSV files in the library which makes data processing user-friendly. Reading CSV files using the inbuilt Python CSV module.

How can I improve the performance of my iterators?

Using itertools.izip (), instead of zip () as in some of the other answers, will improve performance: Works like the zip () function but consumes less memory by returning an iterator instead of a list. Itertools.izip will also work properly even if one of the iterators is infinite.


2 Answers

No. Your hypothetical function has to call one of fA and fB first. Let's say it calls fA and it requests all the A1s before producing anything. Then you don't have any B1s remaining to pass to fB, unless you save them somewhere, potentially leaking memory. If that's acceptable, you can do:

def unzip[A, B](iter: Iterator[(A, B)]) = {
  var qA = Queue.empty[A]
  var qB = Queue.empty[B]

  val iterA = new Iterator[A] {
    override def hasNext = qA.nonEmpty || iter.hasNext

    override def next() = qA.dequeueOption match {
      case Some((a, qA1)) =>
        qA = qA1
        a
      case None =>
        val (a, b) = iter.next()
        qB = qB.enqueue(b)
        a
    }
  }

  // similar iterB

  (iterA, iterB)
}

and then

val (iterA, iterB) = unzip(iterator)
fA(iterAfA).zip(fB(iterB))

(Well, you can also write iterator => fA(iterator.map(_._1)).zip(fB(iterator.map(_._2)): it has the right type, but is probably not what you want. Namely, it will use only one field of each tuple produced by the original iterator, and drop the other.)

like image 189
Alexey Romanov Avatar answered Sep 21 '22 12:09

Alexey Romanov


I came to much simpler implementation:

def iterUnzip[A1, B1, A2, B2](it: Iterator[(A1, B1)],
                           fA: (Iterator[A1]) => Iterator[A2],
                           fB: (Iterator[B1]) => Iterator[B2]) =
  it.toStream match {
    case s => fA(s.map(_._1).toIterator).zip(fB(s.map(_._2).toIterator))
  }

The idea is to convert iterator to stream. Stream in Scala is lazy but also provides memoization. This effectively provides the same buffering mechanism, as in @AlexeyRomanov's solution, but more concise. The only drawback is that Stream stores memoized elements on stack as opposed to the explicit Queue, thus if fA and fB produce elements on uneven rate, you may get StackOverflow exception.

Test that evaluation is lazy indeed:

val iter = Stream.from(0).map(x => (x, x + 1))
  .map(x => {println("fetched: " + x); x}).take(5).toIterator

iterUnzip(
  iter,
  (_:Iterator[Int]).flatMap(x => List(x, x)),
  (_:Iterator[Int]).map(_ + 1)
).toList

Result:

fetched: (0,1)
iter: Iterator[(Int, Int)] = non-empty iterator

fetched: (1,2)
fetched: (2,3)
fetched: (3,4)
fetched: (4,5)
res0: List[(Int, Int)] = List((0,2), (0,3), (1,4), (1,5), (2,6))

I also tried reasonably hard to get StackOverflow exception by producing uneven iterators, but failed.

val iter = Stream.from(0).map(x => (x, x + 1)).take(10000000).toIterator
iterUnzip(
    iter,
    (_:Iterator[Int]).flatMap(x => List.fill(1000000)(x)),
    (_:Iterator[Int]).map(_ + 1)
  ).size

Works fine on -Xss5m and produces:

res10: Int = 10000000

So, overall this is reasonably good and concise solution, unless you have some extreme usecases.

like image 24
Aivean Avatar answered Sep 21 '22 12:09

Aivean