Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to force Spark to evaluate DataFrame operations inline

According to the Spark RDD docs:

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline.

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

like image 723
smeeb Avatar asked Sep 08 '16 00:09

smeeb


People also ask

Why does Spark use lazy evaluation?

Hence, Lazy evaluation enhances the power of Apache Spark by reducing the execution time of the RDD operations. It maintains the lineage graph to remember the operations on RDD. As a result, it Optimizes the performance and achieves fault tolerance.

What is eager evaluation in Spark?

Although most things in Spark SQL are executed lazily, Commands evaluate eagerly. It means that Apache Spark executes them as soon as you define them in your pipeline, e.g. using sql() method.

Are Spark Dataframes lazy?

Spark's lazy evaluation model acts fast. Second, Spark's execution model relies on what is called lazy evaluation. In Spark, operations are generally broken up into transformations applied to data sets and actions intended to derive and produce a result from that series of transformations.

Is Spark SQL lazy?

yes,By default all transformations in spark are lazy.

Why can’t I see the entire data in a spark dataframe?

This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called. In this article we will check commonly used Actions on Spark dataframe. The show () operator is used to display records of a dataframe in the output. By default it displays 20 records. To see the entire data we need to pass parameter

When we call an action on a spark dataframe all the transformations?

When we call an Action on a Spark dataframe all the Transformations gets executed one by one. This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called.

What is an execution plan in spark?

An execution plan is the set of operations executed to translate a query language statement (SQL, Spark SQL, Dataframe operations etc.) to a set of optimized logical and physical operations. To sum up, it’s a set of operations that will be executed from the SQL (or Spark SQL) statement to the DAG which will be send to Spark Executors.

What is the use of head () in spark dataframe?

head () operator returns the first row of the Spark Dataframe. If you need first n records then you can use head (n). Lets look at the various versions head () – returns first row


1 Answers

No.

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.


By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.


Can you just confirm that count() and show() are considered "actions"

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

Yes! :)

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!


Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!

like image 125
gsamaras Avatar answered Oct 03 '22 20:10

gsamaras