When running the AWS Glue crawler it does not recognize timestamp columns. I have correctly formatted ISO8601 timestamps in my CSV file. First I expected Glue to automatically classify these as timestamps, which it does not. I also tried a custom timestamp classifier from this link https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html Here is what my classifier looks like <img src="https://i.imgur.com/sRlDAXV.png" alt="grok classifier"> This also does not correctly classify my timestamps. I have put into grok debugger (https://grokdebug.herokuapp.com/) my data, for example <pre class="prettyprint"><code>id,iso_8601_now,iso_8601_yesterday 0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056 1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056 </code></pre> and it matches on both %{TIMESTAMP_ISO8601:timestamp} %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}? <pre class="prettyprint"><code>import csv from datetime import datetime, timedelta with open("timestamp_test.csv", 'w', newline='') as f: w = csv.writer(f, delimiter=',') w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"]) for i in range(1000): w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()]) </code></pre> I expect AWS glue to automatically classify the iso_8601 columns as timestamps. Even when adding the custom grok classifier it still does not classify the either of the columns as timestamp. Both columns are classified as strings. The classifer is active on the crawler <img src="https://i.imgur.com/EZfukxv.png" alt="classifier active"> Output of the timestamp_test table by the crawler <pre class="prettyprint"><code>{ "StorageDescriptor": { "cols": { "FieldSchema": [ { "name": "id", "type": "bigint", "comment": "" }, { "name": "iso_8601_now", "type": "string", "comment": "" }, { "name": "iso_8601_yesterday", "type": "string", "comment": "" } ] }, "location": "s3://REDACTED/_csv_timestamp_test/", "inputFormat": "org.apache.hadoop.mapred.TextInputFormat", "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat", "compressed": "false", "numBuckets": "-1", "SerDeInfo": { "name": "", "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe", "parameters": { "field.delim": "," } }, "bucketCols": [], "sortCols": [], "parameters": { "skip.header.line.count": "1", "sizeKey": "58926", "objectCount": "1", "UPDATED_BY_CRAWLER": "REDACTED", "CrawlerSchemaSerializerVersion": "1.0", "recordCount": "1227", "averageRecordSize": "48", "CrawlerSchemaDeserializerVersion": "1.0", "compressionType": "none", "classification": "csv", "columnsOrdered": "true", "areColumnsQuoted": "false", "delimiter": ",", "typeOfData": "file" }, "SkewedInfo": {}, "storedAsSubDirectories": "false" }, "parameters": { "skip.header.line.count": "1", "sizeKey": "58926", "objectCount": "1", "UPDATED_BY_CRAWLER": "REDACTED", "CrawlerSchemaSerializerVersion": "1.0", "recordCount": "1227", "averageRecordSize": "48", "CrawlerSchemaDeserializerVersion": "1.0", "compressionType": "none", "classification": "csv", "columnsOrdered": "true", "areColumnsQuoted": "false", "delimiter": ",", "typeOfData": "file" } } </code></pre>

According to CREATE TABLE doc, the timestamp format is <code>yyyy-mm-dd hh:mm:ss[.f...]</code> If you must use the ISO8601 format, add this Serde parameter <code>'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'</code> You can alter the table from Glue(1) or recreate it from Athena(2): <ol> <li>Glue console > tables > edit table > add the above to Serde parameters. You will also need to click on "edit schema" and change data types from string to timestamp </li> <li>From Athena delete the table and run:</li> </ol> <pre class="prettyprint"><code>CREATE EXTERNAL TABLE `table1`( `id` bigint, `iso_8601_now` timestamp, `iso_8601_yesterday` timestamp) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim' = ',', 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') LOCATION 's3://REDACTED/_csv_timestamp_test/' </code></pre>

AWS Glue: Crawler does not recognize Timestamp columns in CSV format

Tags:

aws-glue

When running the AWS Glue crawler it does not recognize timestamp columns.

I have correctly formatted ISO8601 timestamps in my CSV file. First I expected Glue to automatically classify these as timestamps, which it does not.

I also tried a custom timestamp classifier from this link https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html

Here is what my classifier looks like

grok classifier

This also does not correctly classify my timestamps.

I have put into grok debugger (https://grokdebug.herokuapp.com/) my data, for example

id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056

and it matches on both

%{TIMESTAMP_ISO8601:timestamp}

%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

import csv
from datetime import datetime, timedelta

with open("timestamp_test.csv", 'w', newline='') as f:
    w = csv.writer(f, delimiter=',')

    w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])

    for i in range(1000):
        w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])

I expect AWS glue to automatically classify the iso_8601 columns as timestamps. Even when adding the custom grok classifier it still does not classify the either of the columns as timestamp.

Both columns are classified as strings.

The classifer is active on the crawler classifier active

Output of the timestamp_test table by the crawler

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "bigint",
                    "comment": ""
                },
                {
                    "name": "iso_8601_now",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "iso_8601_yesterday",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://REDACTED/_csv_timestamp_test/",
        "inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
            "parameters": {
                "field.delim": ","
            }
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "skip.header.line.count": "1",
            "sizeKey": "58926",
            "objectCount": "1",
            "UPDATED_BY_CRAWLER": "REDACTED",
            "CrawlerSchemaSerializerVersion": "1.0",
            "recordCount": "1227",
            "averageRecordSize": "48",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "compressionType": "none",
            "classification": "csv",
            "columnsOrdered": "true",
            "areColumnsQuoted": "false",
            "delimiter": ",",
            "typeOfData": "file"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "skip.header.line.count": "1",
        "sizeKey": "58926",
        "objectCount": "1",
        "UPDATED_BY_CRAWLER": "REDACTED",
        "CrawlerSchemaSerializerVersion": "1.0",
        "recordCount": "1227",
        "averageRecordSize": "48",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "false",
        "delimiter": ",",
        "typeOfData": "file"
    }
}

881

asked May 16 '19 23:05

William Seagar

1 Answers

According to CREATE TABLE doc, the timestamp format is yyyy-mm-dd hh:mm:ss[.f...]

If you must use the ISO8601 format, add this Serde parameter 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'

You can alter the table from Glue(1) or recreate it from Athena(2):

Glue console > tables > edit table > add the above to Serde parameters. You will also need to click on "edit schema" and change data types from string to timestamp
From Athena delete the table and run:

CREATE EXTERNAL TABLE `table1`(
  `id` bigint, 
  `iso_8601_now` timestamp, 
  `iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 
  'field.delim' = ',',
  'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS') 
LOCATION
  's3://REDACTED/_csv_timestamp_test/'

147

answered Jan 03 '23 14:01

ya2410

Related questions
                            
                                AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options
                            
                                PySpark timeout trying to repartition/write to parquet (Futures timed out after [300 seconds])?
                            
                                How to include AWS Glue crawler in Step Function
                            
                                How to connect AWS Glue to a VPC, and access private resources?
                            
                                AWS Glue: ETL to read S3 CSV files
                            
                                Specify a SerDe serialization lib with AWS Glue Crawler
                            
                                Is there a temporary folder that I can access while using AWS Glue?
                            
                                Combine multiple raw files into single parquet file
                            
                                AWS update Athena meta: Glue Crawler vs MSCK Repair Table
                            
                                AWS Glue - can't set spark.yarn.executor.memoryOverhead
                            
                                Does AWS Lambda can be preferred over AWS Glue Job?
                            
                                Why are new columns added to parquet tables not available from glue pyspark ETL jobs?
                            
                                Issues Creating a Glue Connection to an MS SQL Server RDS
                            
                                AWS Glue: Do I really need a Crawler for new content?
                            
                                Python logging.getLogger not working in AWS Glue python shell job
                            
                                AWS Glue predicate push down condition has no effect
                            
                                convert spark dataframe to aws glue dynamic frame
                            
                                AWS Glue convert files from JSON to Parquet with same partitions as source table
                            
                                Access AWS Glue from local Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With