I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?
My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:
JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();
final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);
This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?
Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.
Distributed Cache in Hadoop is a facility provided by the MapReduce framework. Distributed Cache can cache files when needed by the applications. It can cache read only text files, archives, jar files etc. Once we have cached a file for our job, Apache Hadoop will make it available on each datanodes where map/reduce tasks are running.
First of all, we discuss what is Hadoop and what is Apache Spark. Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets, which also makes it possible to run applications on a system with thousands of nodes.
Big Data Clusters supports deployment time and post-deployment time configuration of Apache Spark and Hadoop components at the service and resource scopes. Big Data Clusters uses the same default configuration values as the respective open source project for most settings.
Each Datanode gets a copy of the file (local-copy) which is sent through Distributed Cache. When the job is finished these files are deleted from the DataNodes.
Please have a look at SparkContext.addFile()
method.
Guess that is what you were looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With