According to the Spark RDD docs:
All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.
There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:
val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'
// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
// Now do some stuff with 'unionDataFrame'...
So my workaround for this (so far) has been to run .show()
or .count()
immediately following my time-sensitive dataframe op, like so:
val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count() // Forces the union to execute/compute
// Now do some stuff with 'unionDataFrame'...
...which forces Spark to execute the dataframe op right then in there, inline.
This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?
Hence, Lazy evaluation enhances the power of Apache Spark by reducing the execution time of the RDD operations. It maintains the lineage graph to remember the operations on RDD. As a result, it Optimizes the performance and achieves fault tolerance.
Although most things in Spark SQL are executed lazily, Commands evaluate eagerly. It means that Apache Spark executes them as soon as you define them in your pipeline, e.g. using sql() method.
Spark's lazy evaluation model acts fast. Second, Spark's execution model relies on what is called lazy evaluation. In Spark, operations are generally broken up into transformations applied to data sets and actions intended to derive and produce a result from that series of transformations.
yes,By default all transformations in spark are lazy.
This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called. In this article we will check commonly used Actions on Spark dataframe. The show () operator is used to display records of a dataframe in the output. By default it displays 20 records. To see the entire data we need to pass parameter
When we call an Action on a Spark dataframe all the Transformations gets executed one by one. This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called.
An execution plan is the set of operations executed to translate a query language statement (SQL, Spark SQL, Dataframe operations etc.) to a set of optimized logical and physical operations. To sum up, it’s a set of operations that will be executed from the SQL (or Spark SQL) statement to the DAG which will be send to Spark Executors.
head () operator returns the first row of the Spark Dataframe. If you need first n records then you can use head (n). Lets look at the various versions head () – returns first row
No.
You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.
By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.
Can you just confirm that
count()
andshow()
are considered "actions"
You can see some of the action functions of Spark in the documentation, where count()
is listed. show()
is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)
Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?
Yes! :)
spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!
Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With