Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

Has anyone else run into this issue? Is there a better way to do this?

like image 361
gscho Avatar asked Oct 25 '17 15:10

gscho


People also ask

How do I add a classifier to a crawler in AWS Glue?

Change the column names in the Data Catalog, set the SchemaChangePolicy to LOG, and set the partition output configuration to InheritFromTable for future crawler runs. Create a custom grok classifier to parse the data and assign the columns that you want.

Which data format AWS Glue built in classifier Cannot parse?

Built-in classifiers can't parse fixed-width data files. Use a grok custom classifier instead.

How does the crawler infer the schema of a csv file?

For CSV files, the crawler reads either the first 1000 records or the first 1 MB of data, whatever comes first. For Parquet files, the crawler infers the schema directly from the file. The crawler compares the schemas inferred from all the subfolders and files, and then creates one or more tables.

What determines the schema of the data in AWS Glue?

Classifier. Determines the schema of your data. AWS Glue provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others. It also provides classifiers for common relational database management systems using a JDBC connection.


2 Answers

I have two json files which are 42mb and 16mb, partitioned on S3 as path:

  • s3://bucket/stg/year/month/_0.json

  • s3://bucket/stg/year/month/_1.json

I had the same problem as you, crawler classification as UNKNOWN.

I were able to solved it:

  • You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
  • Run your new crawler with the data on S3 and proper schema will be created.
  • DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
like image 144
Dominic Nguyen Avatar answered Oct 23 '22 11:10

Dominic Nguyen


As mentioned in

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json

When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.

That is something which Dung also pointed out in his answer.

like image 2
user3056726 Avatar answered Oct 23 '22 10:10

user3056726