Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe
, which doesn't classify correctly (e.g. for quoted fields with commas in)
I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde
.
I've tried making my own csv Classifier but that doesn't help.
How do I get the crawler to specify a particular serialization lib for the tables produced or updated?
On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler , and choose Next.
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views. Download the driver and store the jar to an S3 bucket. Specify the S3 path to the driver as a dependent jar in your job definition.
Are there separate charges for AWS Glue? Yes. With AWS Glue, you pay a monthly rate for storing and accessing the metadata stored in the AWS Glue Data Catalog, an hourly rate billed per second for AWS Glue ETL jobs and crawler runtime, and an hourly rate billed per second for each provisioned development endpoint.
You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...
Create a Glue Crawler with the following configuration.
Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog
Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.
Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".
Re-run the crawler.
In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.
You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With