How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation? I have tried to put <code>cache()</code> with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

Short answer: To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple <code>count</code> action is sufficient. TL;DR: Ok, let's review the <code>RDD</code> operations. <code>RDD</code>s support two types of operations: <ul> <li> transformations - which create a new dataset from an existing one.</li> <li> actions - which return a value to the driver program after running a computation on the dataset. </li> </ul> For example, <code>map</code> is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, <code>reduce</code> is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel <code>reduceByKey</code> that returns a distributed dataset). All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset. By default, each transformed <code>RDD</code> may be recomputed each time you run an action on it. However, you may also persist an <code>RDD</code> in memory using the <code>persist</code> (or <code>cache</code>) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting <code>RDD</code>s on disk, or replicated across multiple nodes. <h3>Conclusion</h3> To force Spark to execute a call to map, you'll need to require a result. Sometimes a <code>count</code> action is sufficient. <h3>Reference</h3> <ul> <li> Spark Programming Guide.</li> </ul>

How can I force Spark to execute code?

2 Answers

Short answer:

To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.

TL;DR:

Ok, let's review the RDD operations.

RDDs support two types of operations:

transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away.

Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

Conclusion

To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.

Reference

Spark Programming Guide.

151

answered Sep 23 '22 03:09

eliasah

Spark transformations only describe what has to be done. To trigger an execution you need an action.

In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

answered Sep 22 '22 03:09

zero323

Related questions
                            
                                Detecting a JRadioButton state change
                            
                                Generate RSA key pair and encode public as string
                            
                                Dynamic ListView in Android app
                            
                                Why should I override hashCode() when I override equals() method?
                            
                                ArithmeticException thrown during BigDecimal.divide
                            
                                no such method error: ImmutableList.copyOf()
                            
                                java: getting the class of the components of an array
                            
                                How to handle HTTP authentication using HttpURLConnection?
                            
                                Preferred way of getting the selected item of a JComboBox
                            
                                jdk 6 on mountain lion
                            
                                Program does not terminate immediately when all ExecutorService tasks are done
                            
                                How to enable logging for SQL statements when using JDBC
                            
                                In java8, how to set the global value in the lambdas foreach block?
                            
                                Hamcrest - what version to use? 1.3 or 2
                            
                                Java "target type of lambda conversion must be an interface"
                            
                                Why does the ExecutorService interface not implement AutoCloseable?
                            
                                Transitive nature of equals method
                            
                                Servlet Mapping using web.xml
                            
                                $push and $set in same MongoDB update
                            
                                Matrix multiplication using arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I force Spark to execute code?

Tags:

java

scala

apache-spark

hadoop

MetallicPriest

People also ask