Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark: java.io.IOException: No space left on device [again!]

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr

df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory = FALSE)

myquery <- df_new %>% group_by(text) %>% summarize(mycount = n()) %>% 
  arrange(desc(mycount)) %>% head(10)

#this FAILS
get_result <- collect(myquery)

I do have set both

  • spark.local.dir <- "/mypath/"
  • spark.worker.dir <- "/mypath/"

using the usual

config <- spark_config()

config$`spark.executor.memory` <- "100GB"
config$`spark.executor.cores` <- "3"
config$`spark.local.dir` <- "/mypath/"
config$`spark.worker.dir` <- "mypath/"
config$`spark.cores.max`<- "2000"
config$`spark.default.parallelism`<- "4"
config$`spark.total-executor-cores`<- "80"
config$`sparklyr.shell.driver-memory` <- "100G"
config$`sparklyr.shell.executor-memory` <- "100G"
config$`spark.yarn.executor.memoryOverhead` <- "100G"
config$`sparklyr.shell.num-executors` <- "90"
config$`spark.memory.fraction` <- "0.2"

  Sys.setenv(SPARK_HOME="mysparkpath")
  sc <- spark_connect(master = "spark://mynode", config = config)

where mypath has more than 5TB of disk space (I can see these options in the Environment tab). I tried a similar command in Pyspark and it failed the same way (same error).

By looking at the Stages tab in Spark, I see that the error occurs when shuffle write is about 60 GB. (input is about 200GB). This is puzzling given that I have plenty of space available. I have have looked at the other SO solutions already...

The cluster job is started with magpie https://github.com/LLNL/magpie/blob/master/submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark

Every time I start a Spark job, I see a directory called spark-abcd-random_numbers in my /mypath folder. but the size of the files in there is very small (nowhere near the 60GB shuffle write)

  • there are about 40 parquet files. each is 700K (original csv files were 100GB) They contain strings essentially.
  • cluster is 10 nodes, each has 120GB RAM and 20 cores.

What is the problem here? Thanks!!

like image 403
ℕʘʘḆḽḘ Avatar asked Jul 03 '17 14:07

ℕʘʘḆḽḘ


4 Answers

I ve had this problem multiple times before. The reason behind is the temporary files. most of servers have a very small size partition for /tmp/ which is the default temporary directory for spark.
Usually, I used to change that by setting that in spark-submit command as the following:

$spark-submit --master local[*] --conf "spark.driver.extraJavaOptions=-Djava.io.tmpdir=/mypath/" ....

In your case, I think that you can provide that to the configuration in R as following (I have not tested that but that should work):

config$`spark.driver.extraJavaOptions` <- "-Djava.io.tmpdir=/mypath/"
config$`spark.executor.extraJavaOptions ` <- "-Djava.io.tmpdir=/mypath/"

Notice that you have to change that for the driver and executors since you're using Spark standalone master (as I can see in your question)

I hope that will help

like image 171
user1314742 Avatar answered Nov 17 '22 15:11

user1314742


change following settings in your magpie script

export MAGPIE_LOCAL_DIR="/tmp/${USER}/magpie" 
export SPARK_LOCAL_DIR="/tmp/${USER}/spark"

to have mypath prefix and not /tmp

like image 45
Igor Berman Avatar answered Nov 17 '22 13:11

Igor Berman


Once you set the parameter, you can see the new value of spark.local.dir in Spark environment UI. But it doesn't reflect.

Even I faced the similar problem. After setting this parameter, I restarted the machines and then started working.

like image 2
Santhosh Tangudu Avatar answered Nov 17 '22 14:11

Santhosh Tangudu


Since you need to set this when the JVM is launched via spark-submit, you need to use the sparklyr java-options, e.g.

config$`sparklyr.shell.driver-java-options` <- "-Djava.io.tmpdir=/mypath"

like image 1
kevinykuo Avatar answered Nov 17 '22 13:11

kevinykuo