Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Caching in RAM using HDFS

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).

I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.

Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.

The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.

Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?

like image 257
Jacopofar Avatar asked Oct 21 '22 12:10

Jacopofar


1 Answers

Since the release of Hadoop 2.3 you can use HDFS in memory caching.

like image 174
Santiago Cepas Avatar answered Oct 30 '22 02:10

Santiago Cepas