I have an AWS Glue job that reads from a data source like so:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-data", table_name = "contacts", transformation_ctx = "datasource0")
But when I call .toDF() on the dynamic frame, the headers are 'col0', 'col1', 'col2' etc. and my actual headers are in the first row of the dataframe.
Note - I can't set them manually as the columns in the data source are variable & iterating over the columns in a loop to set them results in error because you'd have to set the same dataframe variable multiple times, which glue can't handle.
How might I capture the headers while reading from the data source?
You can use Athena to query AWS Glue catalog metadata like databases, tables, partitions, and columns. To obtain AWS Glue Catalog metadata, you query the information_schema database on the Athena backend. The example queries in this topic show how to use Athena to query AWS Glue Catalog metadata for common use cases.
You can by using the Athena JDBC driver. This approach circumvents the catalog, as only Athena (and not Glue as of 25-Jan-2019) can directly access views. Download the driver and store the jar to an S3 bucket. Specify the S3 path to the driver as a dependent jar in your job definition.
You can try withHeader param. e.g.
dyF = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
'csv',
{'withHeader': True})
The documentation for this can be found here
I know this post is old, but I just ran into a similar issue and spent way too long figuring out what the problem was. Wanted to share my solution in case it's helpful to others!
I was using the GUI on AWS and forgot to actually add the correct classifier to the crawler before running it. This resulted in AWS Glue incorrectly detecting datatypes (they mostly came out as strings) and the column names were not detected (they came out as col1, col2, etc). You can create the classifier in "classifiers" under "crawlers". Then, when setting up the crawler, add your classifier to the "selected classifiers" section at the bottom.
Documentation: https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With