Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Confusion about distributed cache in Hadoop

What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node? If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..

(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )

Thanks and regards, Dhruv Kapur.

like image 214
Dhruv Kapur Avatar asked May 20 '14 05:05

Dhruv Kapur


People also ask

What is distributed cache in Hadoop?

Distributed cache in Hadoop is a way to copy small files or archives to worker nodes in time. Hadoop does this so that these worker nodes can use them when executing a task. To save the network bandwidth the files get copied once per job.

Is distributed cache file also stored in HDFS?

For implementing the DistributedCache, the applications specify the cached files via URLs in the form hdfs://in the Job. The Hadoop DistributedCache presumes that the files specified through the URLs are present on the FileSystem at the path specified, and every node in the cluster has access permission to that file.

What is the default size of distributed cache in Hadoop?

By default, cache size is 10GB.

What are the advantages of using a distributed cache?

With a distributed cache, you can have a large number of concurrent web sessions that can be accessed by any of the web application servers that are running the system. This lets you load balance web traffic over several application servers and not lose session data should any application server fail.


1 Answers

DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

Refer https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/filecache/DistributedCache.html

Let me know if still you have some questions.

You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).

Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/

-Ashish

like image 105
Ashish Avatar answered Sep 24 '22 04:09

Ashish