Where does Spark actually persist RDDs on disk?

1 Answers

As per the doc:

spark.local.dir (by default /tmp)

Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

answered Sep 28 '22 12:09

Francois G

Related questions
                            
                                Data shuffle for Hive and Spark window function
                            
                                How to build a sparse matrix in PySpark?
                            
                                Kryo: deserialize old version of class
                            
                                Group by and order by in Spark SQL
                            
                                CodeGen grows beyond 64 KB error when normalizing large PySpark dataframe
                            
                                How to have Apache Spark running on GPU?
                            
                                Read parquet into spark dataset ignoring missing fields [duplicate]
                            
                                How to get the number of records written (using DataFrameWriter's save operation)?
                            
                                Spark - csv read option
                            
                                YARN applications cannot start when specifying YARN node labels
                            
                                Connection from Spark to snowflake
                            
                                Comparing two data frames in Spark (performance)
                            
                                What is the difference between partitioning and bucketing in Spark?
                            
                                How we save a Huge pyspark dataframe?
                            
                                Efficient reading nested parquet column in Spark
                            
                                How to submit multiple spark jobs to single AWS EMR cluster
                            
                                Implementing a recursive algorithm in pyspark to find pairings within a dataframe
                            
                                PySpark "illegal reflective access operation" when executed in terminal
                            
                                Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer
                            
                                pyspark: Save schemaRDD as json file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Where does Spark actually persist RDDs on disk?

Tags:

apache-spark

Haoliang

People also ask

1 Answers

Francois G

Recent Activity

Donate For Us