Hadoop DistributedCache functionality in Spark

Tags:

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

Click to copy

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

626

asked Sep 02 '14 14:09

Mikel Urkia

1 Answers

Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

165

answered Sep 18 '22 00:09

Sai Krishna

Related questions
                            
                                Remote access to HDFS on Kubernetes
                            
                                Job 65 cancelled because SparkContext was shut down
                            
                                hadoop beginners question
                            
                                Should I prefer hadoop vs condor when working with R?
                            
                                Cassandra wih Hive
                            
                                How does hive/hadoop assures that each mapper works on data that is local for it?
                            
                                Hadoop Throws ClassCastException for the keytype of java.nio.ByteBuffer
                            
                                How do I install Cloudera Hue on Mac OS X Lion?
                            
                                Suggestions on distributing python data/code over worker nodes?
                            
                                What Are the Pros and Cons of Running a Job in Hadoop Using Various Languages?
                            
                                A mapreduce job with plain text input and avro output
                            
                                Rails with Hadoop
                            
                                How to find optimal number of mappers when running Sqoop import and export?
                            
                                Hadoop MapReduce read the data set once for multiple jobs
                            
                                Documentation for installing and running hadoop 2.2 on Windows [closed]
                            
                                Hadoop 2.2.0 is compatible with Mahout 0.8?
                            
                                Subclassing Avro record?
                            
                                Hive - Checking if an array in each row of a table contains any matching data in a column in another table
                            
                                Write to multiple outputs by key Scalding Hadoop, one MapReduce Job
                            
                                How to query when connecting mongodb with apache-spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop DistributedCache functionality in Spark

Tags:

distribute

apache-spark

hadoop

distributed-cache

Mikel Urkia

People also ask

1 Answers

Sai Krishna

Recent Activity

Donate For Us