Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error

java.io.IOException: No space left on the device

On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers

Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.

Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.

I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory

SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
like image 658
Vigneshwaran Avatar asked Dec 28 '20 13:12

Vigneshwaran


2 Answers

As @Prajappati stated, there are several solutions.

These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:

  • Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.

  • Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.

    At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:

    • Glue 2.0 Engine
    • Glue job parameters:
    Parameter Value Explanation
    --write-shuffle-files-to-s3 true Main parameter (required)
    --write-shuffle-spills-to-s3 true Optional
    --conf spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path> Optional. If not set, the path --TempDir/shuffle-data will be used instead

    Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.

like image 85
MarcosBernal Avatar answered Oct 12 '22 07:10

MarcosBernal


According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame. As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and

There are 2 option to consider.

  1. Upgrade your worker type to G.2X and/or increase the number of workers.

  2. Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0. The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing. Please could you add the following additional job parameters. You can do this via the following steps:

  • Open the "Jobs" tab in the Glue console.
  1. Select the job you want to apply this to, then click "Actions" then click "Edit Job".
  2. Scroll down and open the drop down named "Security configuration, script libraries, and job parameters (optional)".
  3. Under job parameters, enter the following key value pairs:
  • Key: --write-shuffle-files-to-s3 Value: true
  • Key: --write-shuffle-spills-to-s3 Value: true
  • Key: --conf Value: spark.shuffle.glue.s3ShuffleBucket=S3://

Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data. 5) Click "Save" then run the job.

like image 42
Prajapati Mehul Avatar answered Oct 12 '22 05:10

Prajapati Mehul