I need to process some big files (~2 TBs) with a small cluster (~10 servers), in order to produce a relatively small report (some GBs).
I only care about the final report, not intermediate results, and the machines have a great amount of RAM, so it would be fantastic to use it to reduce as much as possible disk access (and consequently increasing speed), ideally by storing the data blocks in volatile memory using the disk only when.
Looking at the configuration files and a previous question it seems Hadoop doesn't offer this function. Spark website talks about a memory_and_disk option, but I'd prefer to ask the company to deploy a new software based on a new language.
The only "solution" I found is to set dfs.datanode.data.dir
as /dev/shm/
in hdfs-default.xml, to trick it to use volatile memory instead of the filesystem to store data, still in this case it would behave badly, I assume, when the RAM gets full and it uses the swap.
Is there a trick to make Hadoop store datablocks as much as possible on RAM and write on disk only when necessary?
Since the release of Hadoop 2.3 you can use HDFS in memory caching.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With