I am getting the java.io.IOException: No space left on device
that occurs after running a simple query in sparklyr
. I use both last versions of Spark
(2.1.1) and Sparklyr
df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)
myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>%
arrange(desc(mycount)) %>% head(10)
#this FAILS
get_result <- collect(myquery)
I do have set both
spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"
using the usual
config <- spark_config()
config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"
Sys.setenv(SPARK_HOME="mysparkpath")
sc <- spark_connect(master = "spark://mynode", config = config)
where mypath
has more than 5TB of disk space (I can see these options in the Environment
tab). I tried a similar command in Pyspark
and it failed the same way (same error).
By looking at the Stages
tab in Spark
, I see that the error occurs when shuffle write
is about 60 GB
. (input is about 200GB
). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...
The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark
Every time I start a Spark job, I see a directory called spark-abcd-random_numbers
in my /mypath
folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)
700K
(original csv
files were 100GB) They contain strings essentially.What is the problem here? Thanks!!
I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/
which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit
command as the following:
$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....
In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):
config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"
Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)
I hope that will help
change following settings in your magpie script
export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie"
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"
to have mypath
prefix and not /tmp
Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.
Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.
Since you need to set this when the JVM is launched via spark-submit
, you need to use the sparklyr
java-options, e.g.
config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With