Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Life of distributed cache in Hadoop

When files are transferred to nodes using the distributed cache mechanism in a Hadoop streaming job, does the system delete these files after a job is completed? If they are deleted, which i presume they are, is there a way to make the cache remain for multiple jobs? Does this work the same way on Amazon's Elastic Mapreduce?

like image 856
JD Long Avatar asked Dec 19 '10 15:12

JD Long


2 Answers

I was digging around in the source code, and it looks like files are deleted by TrackerDistributedCacheManager about once a minute when their reference count drops to zero. The TaskRunner explicitly releases all its files at the end of a task. Maybe you should edit TaskRunner to not do this, and control the cache through more explicit means yourself?

like image 107
Bkkbrad Avatar answered Oct 11 '22 14:10

Bkkbrad


I cross posted this question at the AWS forum and got a good recommendation to use hadoop fs -get to transfer files in a way that persists across jobs.

like image 45
JD Long Avatar answered Oct 11 '22 13:10

JD Long