Every time I run a glue crawler on existing data, it changes the Serde serialization lib to <code>LazySimpleSerDe</code>, which doesn't classify correctly (e.g. for quoted fields with commas in) <img src="https://i.stack.imgur.com/DZi0p.png" alt="enter image description here"> I then need to manually edit the table details in the Glue Catalog to change it to <code>org.apache.hadoop.hive.serde2.OpenCSVSerde</code>. I've tried making my own csv Classifier but that doesn't help. How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround... <ol> <li> Create a Glue Crawler with the following configuration. Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions. </li> <li> Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde". </li> <li> Re-run the crawler. </li> <li> In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”. </li> <li> You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset. </li> </ol>

Specify a SerDe serialization lib with AWS Glue Crawler

Video Answer

1 Answers

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

Create a Glue Crawler with the following configuration.

Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
Re-run the crawler.
In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

171

answered Sep 18 '22 02:09

swuk

Related questions
                            
                                Postgres Instance on RDS vs Aurora
                            
                                AWS CloudWatch Rule returns FailedInvocation with AWS batch as Target
                            
                                How can I set an AWS Lambda to be invoked asynchronously through HTTP/API Gateway?
                            
                                Deploy gRPC supporting application on AWS using ALB
                            
                                How can we copy s3 files between buckets of different account/credentials using s3 cp and different profiles?
                            
                                How to close an AWS S3 read stream (AWSJavaScriptSDK)
                            
                                Can't create CAA record for subdomain on AWS Route 53
                            
                                How to login in amazon mws with third party app
                            
                                signtool fails to sign a binary with a key from a AWS CloudHSM
                            
                                How AWS KMS determine which key to use when decrypt?
                            
                                Reference "Self" in aws cloudformation template?
                            
                                Connecting to Aurora Serverless remotely
                            
                                AWS Glue: ETL to read S3 CSV files
                            
                                How much it cost to use Amazon S3 for Video Streaming backend?
                            
                                Enabling HTTPS and HTTP with Elastic Beanstalk application
                            
                                What are Vended Logs in AWS CloudWatch?
                            
                                AWS AutoScaling CoolDown components
                            
                                Mocking promise from DynamoDB Documentclient
                            
                                Create/update Amazon Athena tables from Amazon S3 bucket files
                            
                                How to access redis logs on AWS ElastiCache

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Specify a SerDe serialization lib with AWS Glue Crawler

Tags:

amazon-web-services

amazon-athena

aws-glue

aws-glue-data-catalog

Luigi Plinge

People also ask

Video Answer

1 Answers

swuk

Recent Activity

Donate For Us