Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala for loop and iterators

Let's assume I have a very large iterable collection of values (in the order of 100,000 String entries, read from disk one by one), and I do something on its cartesian product (and write the result back to disk, though I won't show that here):

for(v1 <- values; v2 <- values) yield ((v1, v2), 1)

I understand that this is just another way of writing

values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))

This apparently causes the entire collection for each flatMap iteration (or even the entire cartesian product?) to be kept in memory. If you read the first version using the for loop this obviously is unnecessary. Ideally only two entries (the ones being combined) should be kept in memory at all times.

If I reformulate the first version like this:

for(v1 <- values.iterator; v2 <- values.iterator) yield ((v1, v2), 1)

memory consumption is a lot lower, leading me to assume that this version must be fundamentally different. What exactly does it do differently in the second version? Why does Scala not implicitly use iterators for the first version? Is there any speedup when not using iterators in some circumstances?

Thanks! (And also thanks to "lmm" who answered an earlier version of this question)

like image 651
Johannes Avatar asked Dec 10 '14 15:12

Johannes


People also ask

Can we use for loop in Scala?

In Scala, for loop is also known as for-comprehensions. A for loop is a repetition control structure which allows us to write a loop that is executed a specific number of times. The loop enables us to perform n number of steps together in one line.

What is Scala iterator?

Delta Lake with Apache Spark using Scala An iterator is not a collection, but rather a way to access the elements of a collection one by one. The two basic operations on an iterator it are next and hasNext. A call to it. next() will return the next element of the iterator and advance the state of the iterator.

Are iterators lazy in Scala?

Unlike operations directly on a concrete collection like List , operations on Iterator are lazy. A lazy operation does not immediately compute all of its results.

What is one way to iterate Scala?

Two important methods in Scala Iterator are next() and hasNext. hasNext will tell if there is another element to return, while next() will return that element.


2 Answers

In Scala, yield does not produce a lazy sequence. My understanding is that you get all of the values at once so that you can index them all as a collection. For example, I had written the following for a ray tracer to generate rays:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height; x <- 0 until width)
    yield (x, y, aa((x, y)))
}

which failed spectacularly (out of memory) because it made all the rays up front (surprise!). By using the .iterator method, you're specifically asking for a lazy iterator. The above example can be modified to this:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height iterator; x <- 0 until width iterator)
    yield (x, y, aa((x, y)))
}

which works in a lazy manner.

like image 67
plinth Avatar answered Oct 09 '22 09:10

plinth


The first version is strictly evaluated; it creates a real, concrete collection with all those values in. The second "just" provides an Iterator, which lets you iterate over all the values; they'll be created as you actually perform the iteration.

The main reason Scala defaults to the first is because scala as a language allows side effects. If you write your two mappings as:

for(v1 <- values; v2 <- values) yield {println("hello"); ((v1, v2), 1)}
for(v1 <- values.iterator; v2 <- values.iterator) yield {
  println("hello"); ((v1, v2), 1)}

then what happens with the second may surprised you, especially in a larger application where the iterator might be created a long way away from where it's actually used.

A collection will perform better than an iterator if the map operation itself is expensive and you create it once and reuse it multiple times - the iterator has to recalculate the values each time, whereas the collection exists in memory. Arguably this makes the collection performance more predictable - it consumes a lot of memory, but it's the same amount whatever the collection is then used for.

If you want a collections library that is more willing to elide operations and optimize - perhaps because you already write all your code to be without side effects - you might want to consider Paul Philips' new effort.

like image 38
lmm Avatar answered Oct 09 '22 09:10

lmm