Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent of Distributed Cache in Spark? [duplicate]

In Hadoop, you can use the distributed cache to copy read-only files on each node. What is the equivalent way of doing so in Spark? I know about broadcast variables, but that is only good for variables, not files.

like image 493
MetallicPriest Avatar asked Jun 25 '15 00:06

MetallicPriest


People also ask

What is powerful caching in spark?

Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.

Is the DistributedCache file also stored in HDFS True or false?

Files can be on HDFS, local Filesystem, or any Hadoop-readable filesystem like S3. If the user does not specify any scheme then Hadoop assumes that the file is on the local filesystem. This is true even when default filesystem is not the local filesystem. One can also copy archive files using –archives option.

What is a DistributedCache in Apache Hadoop?

Hadoop DistributedCache is a mechanism provided by the Hadoop MapReduce Framework that offers a service for copying read-only files or archives or jar files to the worker nodes, before the execution of any tasks for the job on that node. Files get normally copied once per job to save the network bandwidth.

What is the DistributedCache used for?

A distributed cache is a system that pools together the random-access memory (RAM) of multiple networked computers into a single in-memory data store used as a data cache to provide fast access to data.


1 Answers

Take a look at SparkContext.addFile()

Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.

A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.

like image 90
Piotr Rudnicki Avatar answered Oct 01 '22 07:10

Piotr Rudnicki