Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read Headers from Data Source in an AWS Glue Job

I have an AWS Glue job that reads from a data source like so:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")

But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe.

Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle.

How might I capture the headers while reading from the data source?

like image 406
Tibberzz Avatar asked May 30 '18 18:05

Tibberzz


People also ask

How do you read data from a glue table?

You can use Athena to query AWS Glue catalog metadata like databases, tables, partitions, and columns. To obtain AWS Glue Catalog metadata, you query the information_schema database on the Athena backend. The example queries in this topic show how to use Athena to query AWS Glue Catalog metadata for common use cases.

Can I use Athena view as a source for a AWS Glue job?

You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views. Download the driver and store the jar to an S3 bucket. Specify the S3 path to the driver as a dependent jar in your job definition.


2 Answers

You can try withHeader param. e.g.

dyF = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})

The documentation for this can be found here

like image 123
Dheeraj Avatar answered Sep 23 '22 23:09

Dheeraj


I know this post is old, but I just ran into a similar issue and spent way too long figuring out what the problem was. Wanted to share my solution in case it's helpful to others!

I was using the GUI on AWS and forgot to actually add the correct classifier to the crawler before running it. This resulted in AWS Glue incorrectly detecting datatypes (they mostly came out as strings) and the column names were not detected (they came out as col1, col2, etc). You can create the classifier in "classifiers" under "crawlers". Then, when setting up the crawler, add your classifier to the "selected classifiers" section at the bottom.

Documentation: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html

like image 38
TheGreenSpleen25 Avatar answered Sep 22 '22 23:09

TheGreenSpleen25