Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is .parallelize(...) a lazy operation in Apache Spark?

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?

See def parallelize in spark code

Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.

like image 894
Jonathan Avatar asked Mar 13 '23 18:03

Jonathan


1 Answers

parallelize is executed lazily: see L726 of your cited code stating "@note Parallelize acts lazily."

Execution in Spark is only triggered once you call an action e.g. collect or count.

Thus in total with Spark:

  1. Chain of transformations is set up by the user API (you) e.g. parallelize, map, reduce, ...
  2. Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.
like image 118
Martin Senne Avatar answered Mar 25 '23 01:03

Martin Senne