I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"spark.worker.dir <- "/mypath/"using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).
By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
700K (original csv files were 100GB) They contain strings essentially.What is the problem here? Thanks!!
I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:
$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....
In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):
config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"
Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)
I hope that will help
change following settings in your magpie script
export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie"
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"
to have mypath prefix and not /tmp
Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.
Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.
Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.
config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With