Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue - Writing File Takes A Very Long Time

Hi, I have an ETL job in AWS Glue that takes a very long time to write. It reads data from S3 and performs a few transformations (all are not listed below, but the transformations do not seem to be the issue) and then finally writes the data frame to S3. However, this writing operation seems to take a very long time. Approx 30 min for a file that is about 20 MB even when I'm using 10 workers (worker type G.1X). I have used print statements to see what takes time and it seems to be the last operation of writing the file to S3. I have not had this issue before using the same kind of setup.

I'm using Glue version 3.0, Python version 3, and Spark version 3.1.

The number of files that are in the source are almost 50 000 files spread out over many folders, new files are generated automatically every day. The approximate average file size is about 10 KB

Any suggestions on this issue?

#Glue context & spark session
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
#Solves the issues with old datetime in the new version of Spark
spark_conf = SparkConf()
spark_conf.setAll([
    ('spark.sql.legacy.parquet.int96RebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.int96RebaseModeInWrite', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInRead', 'CORRECTED'), 
    ('spark.sql.legacy.parquet.datetimeRebaseModeInWrite', 'CORRECTED')
    ])
conf = SparkConf().set('spark.sql.legacy.parquet.datetimeRebaseModeInRead','CORRECTED')
sc = SparkSession.builder.config(conf=spark_conf).enableHiveSupport().getOrCreate()
#sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#Source(/s) - create dynamic frame
dy = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [
            "s3://.../files/abc/"
        ],
        "recurse": True,
        "groupFiles": "inPartition"
    },
    transformation_ctx="dy",
)

df = dy.toDF()

#Transformation(/s)
df_ready = df\
    .sort(['ID', 'timestamp'], descending=True)\
    .withColumn("timestamp_prev", 
                lag(df.timestamp)
                .over(Window()
                      .partitionBy("ID").orderBy('timestamp')))

df_ready.repartition(1).write.mode('overwrite').parquet("s3a://.../thisismywritefolder/df_ready/")
like image 398
Qwaz Avatar asked Nov 22 '25 18:11

Qwaz


1 Answers

You are repartitioning in the end, which prevents Glue from writing in parallel. If you remove the repartition, you should see increased speeds while writing.

like image 152
Robert Kossendey Avatar answered Nov 25 '25 00:11

Robert Kossendey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!