Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing files in hadoop distributed cache

Tags:

hadoop

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command

DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs

Then, my setup function looks like this:

public void setup(Context context) throws IOException, InterruptedException{
    Configuration conf = context.getConfiguration();
    Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
    //etc
}

However, this localFiles array is always null.

I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either

I'm using hadoop 1.0.3

thanks Peter

like image 941
Peter Cogan Avatar asked Dec 06 '12 15:12

Peter Cogan


People also ask

What is Hadoop distributed cache?

What is Hadoop Distributed Cache? Distributed cache in Hadoop is a way to copy small files or archives to worker nodes in time. Hadoop does this so that these worker nodes can use them when executing a task. To save the network bandwidth the files get copied once per job.

Is distributed cache is read only?

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar. gz files) are un-archived at the slave nodes.

How data is stored in distributed cache?

A distributed cache is a system that pools together the random-access memory (RAM) of multiple networked computers into a single in-memory data store used as a data cache to provide fast access to data.

How do you use distributed cache in MapReduce?

Hadoop's MapReduce framework provides the facility to cache small to moderate read-only files such as text files, zip files, jar files etc. and broadcast them to all the Datanodes(worker-nodes) where MapReduce job is running. Each Datanode gets a copy of the file(local-copy) which is sent through Distributed Cache.


1 Answers

Problem here was that I was doing the following:

Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");

And now it works. Thanks to Harsh on hadoop user list for the help.

like image 145
Peter Cogan Avatar answered Oct 08 '22 10:10

Peter Cogan