Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop DistributedCache functionality in Spark

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

like image 626
Mikel Urkia Avatar asked Sep 02 '14 14:09

Mikel Urkia


People also ask

What is distributed cache in Hadoop?

Distributed Cache in Hadoop is a facility provided by the MapReduce framework. Distributed Cache can cache files when needed by the applications. It can cache read only text files, archives, jar files etc. Once we have cached a file for our job, Apache Hadoop will make it available on each datanodes where map/reduce tasks are running.

What is Hadoop and Apache Spark?

First of all, we discuss what is Hadoop and what is Apache Spark. Hadoop is an open source software framework for distributed storage & processing of huge amount of data sets, which also makes it possible to run applications on a system with thousands of nodes.

Does big data clusters support Spark and Hadoop?

Big Data Clusters supports deployment time and post-deployment time configuration of Apache Spark and Hadoop components at the service and resource scopes. Big Data Clusters uses the same default configuration values as the respective open source project for most settings.

How does distributed cache work with DataNodes?

Each Datanode gets a copy of the file (local-copy) which is sent through Distributed Cache. When the job is finished these files are deleted from the DataNodes.


1 Answers

Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

like image 165
Sai Krishna Avatar answered Sep 18 '22 00:09

Sai Krishna