Scala for loop and iterators

Tags:

Let's assume I have a very large iterable collection of values (in the order of 100,000 String entries, read from disk one by one), and I do something on its cartesian product (and write the result back to disk, though I won't show that here):

for(v1 <- values; v2 <- values) yield ((v1, v2), 1)

I understand that this is just another way of writing

values.flatMap(v1 => values.map(v2 => ((v1, v2), 1)))

This apparently causes the entire collection for each flatMap iteration (or even the entire cartesian product?) to be kept in memory. If you read the first version using the for loop this obviously is unnecessary. Ideally only two entries (the ones being combined) should be kept in memory at all times.

If I reformulate the first version like this:

for(v1 <- values.iterator; v2 <- values.iterator) yield ((v1, v2), 1)

memory consumption is a lot lower, leading me to assume that this version must be fundamentally different. What exactly does it do differently in the second version? Why does Scala not implicitly use iterators for the first version? Is there any speedup when not using iterators in some circumstances?

Thanks! (And also thanks to "lmm" who answered an earlier version of this question)

651

asked Dec 10 '14 15:12

Johannes

2 Answers

In Scala, yield does not produce a lazy sequence. My understanding is that you get all of the values at once so that you can index them all as a collection. For example, I had written the following for a ray tracer to generate rays:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height; x <- 0 until width)
    yield (x, y, aa((x, y)))
}

which failed spectacularly (out of memory) because it made all the rays up front (surprise!). By using the .iterator method, you're specifically asking for a lazy iterator. The above example can be modified to this:

def viewRays(aa:ViewHelper.AntiAliasGenerator) =
{
  for (y <- 0 until height iterator; x <- 0 until width iterator)
    yield (x, y, aa((x, y)))
}

which works in a lazy manner.

answered Oct 09 '22 09:10

plinth

The first version is strictly evaluated; it creates a real, concrete collection with all those values in. The second "just" provides an Iterator, which lets you iterate over all the values; they'll be created as you actually perform the iteration.

The main reason Scala defaults to the first is because scala as a language allows side effects. If you write your two mappings as:

for(v1 <- values; v2 <- values) yield {println("hello"); ((v1, v2), 1)}
for(v1 <- values.iterator; v2 <- values.iterator) yield {
  println("hello"); ((v1, v2), 1)}

then what happens with the second may surprised you, especially in a larger application where the iterator might be created a long way away from where it's actually used.

A collection will perform better than an iterator if the map operation itself is expensive and you create it once and reuse it multiple times - the iterator has to recalculate the values each time, whereas the collection exists in memory. Arguably this makes the collection performance more predictable - it consumes a lot of memory, but it's the same amount whatever the collection is then used for.

If you want a collections library that is more willing to elide operations and optimize - perhaps because you already write all your code to be without side effects - you might want to consider Paul Philips' new effort.

answered Oct 09 '22 09:10

lmm

Related questions
                            
                                Right arrow in class body in Scala
                            
                                Akka remote actor server discovery
                            
                                Import multiple packages in Scala REPL
                            
                                Setting different spaces for tab for Java and Scala in Eclipse
                            
                                How to access a Java static method from Scala given a type alias for that class it resides in
                            
                                What's the relation of fold on Option, Either etc and fold on Traversable?
                            
                                Basic type inferral
                            
                                curried function related questions in Scala
                            
                                Assigning scala empty array
                            
                                Unmodifiable view of a mutable Scala collection
                            
                                How to handle optional fields in JSON parsing in play 2.1
                            
                                Good, idiomatic way to refactor out business logic from controllers
                            
                                Scala copy case class with generic type
                            
                                Nested implicit macros seems not working for type parameters?
                            
                                How to create an instance of type T at runtime with TypeTags
                            
                                How can I use JMH for Scala benchmarks together with sbt?
                            
                                Wait for an unknown number of futures
                            
                                How to run tests using a custom WebApplicationInitializer?
                            
                                Are the head and tail of a Set guaranteed to be mutually exclusive?
                            
                                Shapeless: generic lens parameterized by case class or field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scala for loop and iterators

Tags:

iterator

loops

scala

Johannes

People also ask

2 Answers

plinth

lmm

Recent Activity

Donate For Us