AWS Glue takes a long time to finish

Tags:

aws-glue

I just run a very simple job as follows

glueContext = GlueContext(SparkContext.getOrCreate())
l_table = glueContext.create_dynamic_frame.from_catalog(
             database="gluecatalog",
             table_name="fctable") 
l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code')
print "Count: ", l_table.count()
l_table.printSchema()
l_table.select_fields(['trans_time']).toDF().distinct().show()
dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/")
print "Before keys() call "
dfc.keys()
print "After keys() call "
l_table.select_fields('table').printSchema()
dfc.select('table_root_table').toDF().where("id = 1 or id = 2").orderBy(['id','index']).show()
dfc.select('table_root').toDF().where("table = 1 or table = 2").show()

The data structure is simple too

root
|-- table: array
| |-- element: struct
| | |-- trans_time: string
| | |-- seq: null
| | |-- operation: string
| | |-- order_date: string
| | |-- order_code: string
| | |-- tbl_code: string
| | |-- ship_plant_code: string
|-- partition_0
|-- partition_1
|-- partition_2
|-- partition_3

When I run job test, it took anywhere from 12 to 16 minutes to finish. But the cloud watch log showed that the job took 2 seconds to display all my data.

So my questions are: Where does AWS Glue job spend its time beyond the logging could show and is what it doing outside the logging period?

488

asked Aug 29 '17 19:08

2 Answers

It's taking the time to setup the environment that allows your code to run. I had the same issue, contacted the AWS GLUE team and they were helpful. The reason it takes a long time is that GLUE builds an environment when you run the first job (which stays alive for 1 hours) if you run the same script twice or any other script within one hour, the next job will take significantly less time. They call this Cold Start when you run the first script, It took my first job 17 minutes, I ran the same job again right after the first one finished and it took 3 minutes only.

191

answered Sep 17 '22 16:09

Rick Coleman

Update as of May 2019 -

Cold start times = 7-8 minutes
Warm pool maintained for = 10-15 mins

answered Sep 21 '22 16:09

human

Related questions
                            
                                Unable to update AWS S3 CORS POLICY
                            
                                Setting NODE_ENV variable in elasticbeanstalk
                            
                                Serverless deploy - Function not found - sls deploy
                            
                                How can I update files on Amazon's CDN (CloudFront)?
                            
                                Can I update Amazon's old versions of pip and setuptools?
                            
                                How to find a file in Amazon S3 bucket without knowing the containing folder
                            
                                Get object from AWS S3 as a stream
                            
                                Forgot password link from aws cognito
                            
                                Cannot run `source` in AWS Codebuild
                            
                                Quickly finding the size of an S3 'folder'
                            
                                AWS - EBS Attached But Can't Find On Instance
                            
                                Status Code 403: SignatureDoesNotMatch when I am using Amazon SES
                            
                                No such file or directory(public/assets/manifest*)
                            
                                Amazon SNS: "Platform credentials are invalid" when re-entering a GCM API key that previously worked
                            
                                PHP AWS SDK throwing unknown error
                            
                                Static content for AWS EC2 with IAM role
                            
                                How do I setup and use Laravel Scheduling on AWS Elastic Beanstalk?
                            
                                How can I get query strings in my Amazon S3 static website?
                            
                                AWS Elastic Beanstalk: Running Cron.d script, missing Environment Variables
                            
                                AWS Code Deploy Error on Before Install Cannot Solve

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue takes a long time to finish

Tags:

amazon-web-services

aws-glue

Shawn

People also ask

2 Answers

Rick Coleman

human

Recent Activity

Donate For Us