I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
java.io.IOException: No space left on the device
On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers
Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.
Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.
I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory
SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
As @Prajappati stated, there are several solutions.
These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:
Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.
Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.
At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:
Parameter | Value | Explanation |
---|---|---|
--write-shuffle-files-to-s3 | true | Main parameter (required) |
--write-shuffle-spills-to-s3 | true | Optional |
--conf | spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path> | Optional. If not set, the path --TempDir/shuffle-data will be used instead |
Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.
According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame. As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and
There are 2 option to consider.
Upgrade your worker type to G.2X and/or increase the number of workers.
Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0. The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing. Please could you add the following additional job parameters. You can do this via the following steps:
Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data. 5) Click "Save" then run the job.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With