Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where does Spark store data when storage level is set to disk?

I was wondering in which directory Spark stores data when Storage level is set to DISK_ONLY or MEMORY_AND_DISK (The data which doesn't fit into memory in that case). Because I see that it makes no difference which level I set to. If the program crashes with MEMORY_ONLY level, it also crashes with all other levels.

In the cluster I'm using, /tmp directory is a RAM disk, and therefore limited in size. Is Spark trying to store the disk level data to that drive? Maybe, that is why I'm not seeing the difference. If that is indeed the case, how can I change this default behavior? If I'm using a yarn cluster that comes with Hadoop, do I need to change the /tmp folder in the hadoop configuration files, or just changing the spark.local.dir with Spark would do?

like image 692
MetallicPriest Avatar asked Oct 19 '22 02:10

MetallicPriest


1 Answers

Yes Spark is tying to store the disk level data to that drive.

In yarn-cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.

Reference: https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes

So for you to change the spark local directory change yarn.nodemanager.local-dirs in your yarn config

like image 82
None Avatar answered Nov 15 '22 05:11

None