Spark has <code>broadcast</code> variables, which are read only, and <code>accumulator</code> variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read? One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried. Scala does have <code>ScalaCache</code>, which is a facade for cache implementations such as <code>Google Guava</code>. But how would such a cache be included and accessed within a Spark application? The cache could be defined as a variable in the driver application which creates the <code>SparkContext</code>. But then there would be two issues: <ul> <li>Performance would presumably be bad because of the network overhead between the nodes and the driver application.</li> <li>To my understanding, each rdd will be passed a copy of the variable (cache in this case) when the variable is first accessed by the function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .</li> </ul> What is the best way to implement and store such a cache? Thanks

Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding <ul> <li>where,</li> <li>when,</li> <li>in what order (excluding of course the order of transformations defined by the lineage / DAG)</li> <li>and how many times</li> </ul> given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular. These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless. If all you want is a simple cache then you have multiple options: <ul> <li>use one of the methods described by Tzach Zohar in Caching in Spark </li> <li>use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local</li> <li>for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)</li> </ul> If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code. <hr> * This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).

How to define a global read\write variables in Spark

Tags:

apache-spark

Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?

One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.

Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?

The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:

Performance would presumably be bad because of the network overhead between the nodes and the driver application.
To my understanding, each rdd will be passed a copy of the variable (cache in this case) when the variable is first accessed by the function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .

What is the best way to implement and store such a cache?

Thanks

462

asked Apr 04 '16 11:04

user1052610

1 Answers

Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding

where,
when,
in what order (excluding of course the order of transformations defined by the lineage / DAG)
and how many times

given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.

These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.

If all you want is a simple cache then you have multiple options:

use one of the methods described by Tzach Zohar in Caching in Spark
use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)

If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.

* This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).

187

answered Oct 25 '22 15:10

4 revs, 4 users 89%

Related questions
                            
                                Using DataFrame with MLlib
                            
                                Iterating through a Spark RDD
                            
                                Livy Server on Amazon EMR hangs on Connecting to ResourceManager
                            
                                Which HBase connector for Spark 2.0 should I use? [closed]
                            
                                Exporting spark dataframe to .csv with header and specific filename
                            
                                How does Spark paralellize slices to tasks/executors/workers?
                            
                                Standalone spark cluster. Can't submit job programmatically -> java.io.InvalidClassException
                            
                                hadoop writables NotSerializableException with Apache Spark API
                            
                                Access public available Amazon S3 file from Apache Spark
                            
                                how can I access spark javadoc or sources from java project?
                            
                                How to extract a value from a Vector in a column of a Spark Dataframe [duplicate]
                            
                                pyspark add new row to dataframe
                            
                                How to handle small file problem in spark structured streaming?
                            
                                How to mock inner call to pyspark sql function
                            
                                Is Apache Spark good for lots of small, fast computations and a few big, non-interactive ones?
                            
                                spark graphx: how to travers a graph to create a graph of second degree neighbors
                            
                                Running Spark on YARN in yarn-cluster mode: Where does the console output go?
                            
                                Spark CollectAsMap
                            
                                Performing lookup/translation in a Spark RDD or data frame using another RDD/df
                            
                                Why does my Spark run slower than pure Python? Performance comparison

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With