spark: java.io.IOException: No space left on device [again!]

Question

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr

df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)

myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>% 
  arrange(desc(mycount)) %>% head(10)

#this FAILS
get_result <- collect(myquery)

I do have set both

spark.local.dir <- "/mypath/"
spark.worker.dir <- "/mypath/"

using the usual

config <- spark_config()

config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"

  Sys.setenv(SPARK_HOME="mysparkpath")
  sc <- spark_connect(master = "spark://mynode", config = config)

where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).

By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...

The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark

Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)

there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
cluster is 10 nodes, each has 120GB RAM and 20 cores.

What is the problem here? Thanks!!

user1314742 · Accepted Answer

I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:

$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....

In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):

config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"

Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)

I hope that will help

Igor Berman · Answer

change following settings in your magpie script

export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie" 
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"

to have mypath prefix and not /tmp

Santhosh Tangudu · Answer

Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.

Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.

kevinykuo · Answer

Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.

config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"

spark: java.io.IOException: No space left on device [again!]

Tags:

r

apache-spark

pyspark

sparklyr

ℕʘʘḆḽḘ

4 Answers

user1314742

Igor Berman

Santhosh Tangudu

kevinykuo

Recent Activity

Donate For Us

spark: java.io.IOException: No space left on device [again!]

Tags:

r

apache-spark

pyspark

sparklyr

ℕʘʘḆḽḘ

4 Answers

user1314742

Igor Berman

Santhosh Tangudu

kevinykuo

Related questions

Recent Activity

Donate For Us