Caching in RAM using HDFS

Question

I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).

I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.

Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.

The only "solution" I found is to set dfs.datanode.data.dir as /dev/shm/ in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.

Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?

Santiago Cepas · Accepted Answer

Since the release of Hadoop 2.3 you can use HDFS in memory caching.

Caching in RAM using HDFS

Tags:

caching

hadoop

hdfs

Jacopofar

1 Answers

Santiago Cepas

Recent Activity

Donate For Us

Caching in RAM using HDFS

Tags:

caching

hadoop

hdfs

Jacopofar

1 Answers

Santiago Cepas

Related questions

Recent Activity

Donate For Us