Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue Crawler Cannot Extract CSV Headers

At my wits end here...

I have 15 csv files that I am generating from a beeline query like:

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

Everything works great. For 14 of them.

Glue is able to extract the header line for every single file except one, naming the columns col_0, col_1, etc, and including the header line in my select queries.

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

I've run tr -d '\000' on it in an effort to clean it up, but that could have not been enough.

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

Thanks for reading.

Edit:

Have a feeling this has something to do with it source:

Every column in a potential header parses as a STRING data type.

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

like image 708
Mac Avatar asked Jan 25 '19 21:01

Mac


People also ask

What is the output of the crawler in AWS glue?

The output of the crawler is one or more metadata tables defined in the AWS Glue Data Catalog. A table is created for one or more files found in your data store. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table.

How do I populate the AWS glue data catalog with tables?

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.

How does the built-in CSV classifier work for AWS glue tables?

The built-in CSV classifier parses CSV file contents to determine the schema for an AWS Glue table. This classifier checks for the following delimiters: Ctrl-A is the Unicode control character for Start Of Heading. To be classified as CSV, the table schema must have at least two columns and two rows of data.

How to check for the S3 bucket and glue crawler in AWS?

We can use the AWS CLI to check for the S3 bucket and Glue crawler: ... # List Glue Crawlers So our CSVCrawlerStack class created the S3 bucket csv-crawler-csvbuckete3c1c3b8–1kgq8cmtaf65h, and the Glue crawler csvcrawler-K97ENFxAYqre.


2 Answers

Adding a Custom Classifier fixed a similar issue of mine.

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header. Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])
like image 53
Thom Lane Avatar answered Sep 28 '22 13:09

Thom Lane


I was having the same issue where Glue does not recognize the header row when all columns are Strings

I found that adding a new column on the end with an integer solves the problem

id,name,extra_column sdf13,dog,1

like image 28
comfytoday Avatar answered Sep 28 '22 13:09

comfytoday