In Hadoop, you can use the distributed cache to copy read-only files on each node. What is the equivalent way of doing so in Spark? I know about broadcast variables, but that is only good for variables, not files.
Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.
Files can be on HDFS, local Filesystem, or any Hadoop-readable filesystem like S3. If the user does not specify any scheme then Hadoop assumes that the file is on the local filesystem. This is true even when default filesystem is not the local filesystem. One can also copy archive files using –archives option.
Hadoop DistributedCache is a mechanism provided by the Hadoop MapReduce Framework that offers a service for copying read-only files or archives or jar files to the worker nodes, before the execution of any tasks for the job on that node. Files get normally copied once per job to save the network bandwidth.
A distributed cache is a system that pools together the random-access memory (RAM) of multiple networked computers into a single in-memory data store used as a data cache to provide fast access to data.
Take a look at SparkContext.addFile()
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get(fileName) to find its download location.
A directory can be given if the recursive option is set to true. Currently directories are only supported for Hadoop-supported filesystems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With