Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specify a SerDe serialization lib with AWS Glue Crawler

Every time I run a glue crawler on existing data, it changes the Serde serialization lib to LazySimpleSerDe, which doesn't classify correctly (e.g. for quoted fields with commas in)

enter image description here

I then need to manually edit the table details in the Glue Catalog to change it to org.apache.hadoop.hive.serde2.OpenCSVSerde.

I've tried making my own csv Classifier but that doesn't help.

How do I get the crawler to specify a particular serialization lib for the tables produced or updated?

like image 982
Luigi Plinge Avatar asked Aug 14 '19 16:08

Luigi Plinge


People also ask

How do you run a crawler on AWS Glue?

On the AWS Glue service console, on the left-side menu, choose Crawlers. On the Crawlers page, choose Add crawler. This starts a series of pages that prompt you for the crawler details. In the Crawler name field, enter Flights Data Crawler , and choose Next.

Can I use Athena view as a source for a AWS Glue job?

You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views. Download the driver and store the jar to an S3 bucket. Specify the S3 path to the driver as a dependent jar in your job definition.

When using Athena you are charged separately for using the AWS Glue data catalog True or false?

Are there separate charges for AWS Glue? Yes. With AWS Glue, you pay a monthly rate for storing and accessing the metadata stored in the AWS Glue Data Catalog, an hourly rate billed per second for AWS Glue ETL jobs and crawler runtime, and an hourly rate billed per second for each provisioned development endpoint.


Video Answer


1 Answers

You can't specify the SerDe in the Glue Crawler at this time but here is a workaround...

  1. Create a Glue Crawler with the following configuration.

    Enable 'Add new columns only’ - This adds new columns as they are discovered, but doesn't remove or change the type of existing columns in the Data Catalog

    Enable 'Update all new and existing partitions with metadata from the table’ - this option inherits metadata properties such as their classification, input format, output format, SerDe information, and schema from their parent table. Any changes to these properties in a table are propagated to its partitions.

  2. Run the crawler to create the table, it will create a table with "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" - Edit this to the "org.apache.hadoop.hive.serde2.OpenCSVSerde".

  3. Re-run the crawler.

  4. In case a new partition is added on crawler re-run, it will also be created with “org.apache.hadoop.hive.serde2.OpenCSVSerde”.

  5. You should now have a table that is set to org.apache.hadoop.hive.serde2.OpenCSVSerde and does not reset.

like image 171
swuk Avatar answered Sep 18 '22 02:09

swuk