How to force Spark to evaluate DataFrame operations inline

Tags:

According to the Spark RDD docs:

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline.

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?

723

asked Sep 08 '16 00:09

smeeb

1 Answers

No.

You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love spark.

By the way, I am pretty sure that spark knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.

Can you just confirm that count() and show() are considered "actions"

You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

Yes! :)

spark remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!

Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!

125

answered Oct 03 '22 20:10

gsamaras

Related questions
                            
                                Error when using multiple python files spark-submit
                            
                                How to get data from a specific partition in Spark RDD?
                            
                                Access to Spark from Flask app
                            
                                Number of Partitions of Spark Dataframe
                            
                                Docker Container with Apache Spark in standalone cluster mode
                            
                                How to use a subquery for dbtable option in jdbc data source?
                            
                                Why there are many spark-warehouse folders got created?
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?
                            
                                Spark streaming with python: how to add a UUID column?
                            
                                Difference between batch interval, sliding interval and window size in spark streaming
                            
                                Failed to find data source: com.mongodb.spark.sql.DefaultSource
                            
                                Can I tell spark.read.json that my files are gzipped?
                            
                                How to use spark-avro package to read avro file from spark-shell?
                            
                                Enriching SparkContext without incurring in serialization issues
                            
                                spark reading large file
                            
                                Using Silhouette Clustering in Spark
                            
                                Convert value depending on a type in SparkSQL via case matching of type
                            
                                How to flatten nested lists in PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to force Spark to evaluate DataFrame operations inline

Tags:

lazy-evaluation

distributed-computing

apache-spark

rdd

spark-dataframe

smeeb

People also ask

1 Answers

gsamaras

Recent Activity

Donate For Us