At my wits end here... I have 15 csv files that I am generating from a beeline query like: <pre class="prettyprint"><code>beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv </code></pre> I chose <code>dsv</code> because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does). Anyway, I upload these 15 csv files to an s3 bucket and run my crawler. Everything works great. For 14 of them. Glue is able to extract the header line for every single file except one, naming the columns <code>col_0</code>, <code>col_1</code>, etc, and including the header line in my select queries. Can anyone provide any insight into what could possibly be different about this one file that is causing this? If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around. I've run <code>tr -d '\000'</code> on it in an effort to clean it up, but that could have not been enough. Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates). Thanks for reading. Edit: Have a feeling this has something to do with it source: <blockquote> Every column in a potential header parses as a STRING data type. Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file. Every column in a potential header must meet the AWS Glue regex requirements for a column name. The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header. </blockquote>

I was having the same issue where Glue does not recognize the header row when all columns are Strings I found that adding a new column on the end with an integer solves the problem id,name,extra_column sdf13,dog,1

AWS Glue Crawler Cannot Extract CSV Headers

Tags:

csv

amazon-athena

aws-glue

At my wits end here...

I have 15 csv files that I am generating from a beeline query like:

Click to copy

beeline -u CONN_STR --outputformat=dsv -e "SELECT ... " > data.csv

I chose dsv because some string fields include commas and they are not quoted, which breaks glue even more. Besides, according to the docs, the built in csv classifier can handle pipes (and for the most part, it does).

Anyway, I upload these 15 csv files to an s3 bucket and run my crawler.

Everything works great. For 14 of them.

Glue is able to extract the header line for every single file except one, naming the columns col_0, col_1, etc, and including the header line in my select queries.

Can anyone provide any insight into what could possibly be different about this one file that is causing this?

If it helps, I have a feeling that some of the fields in this csv file may, at some point, been encoded in UTF-16 or something. When I originally open it, there were some weird "?" characters floating around.

I've run tr -d '\000' on it in an effort to clean it up, but that could have not been enough.

Again, any leads, suggestions, or experiments I can run would be great. Btw, I would prefer if the crawler was able to do everything (ie: not needing to manually change the schema and turn off updates).

Thanks for reading.

Edit:

Have a feeling this has something to do with it source:

Every column in a potential header parses as a STRING data type.

Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.

Every column in a potential header must meet the AWS Glue regex requirements for a column name.

The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

708

asked Jan 25 '19 21:01

Mac

2 Answers

Adding a Custom Classifier fixed a similar issue of mine.

You can avoid header detection (which doesn't work when all columns are string type) by setting ContainsHeader to PRESENT when creating the custom classifier, and then provide the column names through Header. Once the custom classifier has been created you can assign this to the crawler. Since this is added to the crawler, you won't need to make changes to the schema after the fact, and don't risk these changes being overwritten in the next crawler run. Using boto3, it would look something like:

Click to copy

import boto3


glue = boto3.client('glue')

glue.create_classifier(CsvClassifier={
    'Name': 'contacts_csv',
    'Delimiter': ',',
    'QuoteSymbol': '"',
    'ContainsHeader': 'PRESENT',
    'Header': ['contact_id', 'person_id', 'type', 'value']
})

glue.create_crawler(Name=GLUE_CRAWLER,
                    Role=role.arn,
                    DatabaseName=GLUE_DATABASE,
                    Targets={'S3Targets': [{'Path': s3_path}]},
                    Classifiers=['contacts_csv'])

answered Sep 28 '22 13:09

Thom Lane

I was having the same issue where Glue does not recognize the header row when all columns are Strings

I found that adding a new column on the end with an integer solves the problem

id,name,extra_column sdf13,dog,1

answered Sep 28 '22 13:09

comfytoday

Related questions
                            
                                fgetcsv is eating the first letter of a String if it's an Umlaut
                            
                                Using CsvHelper can I translate white space to a nullable?
                            
                                How to write Arabic, Hebrew Into CSV file? [closed]
                            
                                cannot copy CSV into postgreSQL table : timestamp column won't accept the empty string
                            
                                How to validate Class properties?
                            
                                openpyxl python - writing csv to excel gives 'number formatted as text'
                            
                                Append pandas DataFrame column to CSV
                            
                                Importing large CSV files in MySQL using Laravel
                            
                                Save pandas csv to sub-directory
                            
                                Can't access dataframe columns
                            
                                How to do histograms of this row-column table in R ggplot?
                            
                                How to synchronously load a csv file into memory before handling HTTP requests
                            
                                Importing bulk CSV data in UTF-8 into MySQL
                            
                                Pytables vs. CSV for files that are not very large
                            
                                A CSV Import/Export wizard for Delphi?
                            
                                macro to Import csv file into an excel non active worksheet
                            
                                How can i quote escape characters in csv writer in python
                            
                                How to loop through specific range of rows with Python csv reader?
                            
                                View semicolon-separated .csv files in gnumeric
                            
                                Reading from CSV: delimiter must be a string, not unicode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With