AWS Glue Crawler Classifies json file as UNKNOWN

Tags:

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.

I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.

Has anyone else run into this issue? Is there a better way to do this?

361

asked Oct 25 '17 15:10

gscho

2 Answers

I have two json files which are 42mb and 16mb, partitioned on S3 as path:

s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json

I had the same problem as you, crawler classification as UNKNOWN.

I were able to solved it:

You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work

144

answered Oct 23 '22 11:10

Dominic Nguyen

As mentioned in

https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json

When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.

That is something which Dung also pointed out in his answer.

answered Oct 23 '22 10:10

user3056726

Related questions
                            
                                Add Index column to dataTable
                            
                                How to query in DocumentDB based on inner json object value?
                            
                                JSONArray response with Volley for Android
                            
                                Oracle 12c JSON Query Issue with Dot Notation and Double Quotes
                            
                                What is the best way to work with nested JSON structures in Golang?
                            
                                How to remove duplicate and sort objects from JSONArray using Java
                            
                                How to preserve integer data type when exporting to JSON?
                            
                                How do I stream JSON from node?
                            
                                How do I add fields to log4j2's JSON logs
                            
                                Using jq to split a string into nested objects
                            
                                add item to the collection with foreign key via REST call
                            
                                Deserializing field from nested objects within JSON response with Jackson
                            
                                Pass JSON Data from PHP to Python Script
                            
                                Conditional field requirement based on another field value in Jackson?
                            
                                What database to use for my Electron offline Application [closed]
                            
                                Multiple selected items RecyclerView in Activity.java
                            
                                Generate code for multiple swaggers in the same project
                            
                                sending JSON object along with file using FormData in ajax call and accessing the json object in PHP
                            
                                Deserialize JSON array of arrays into List of Tuples using Newtonsoft
                            
                                How can I print nulls when converting a dataframe to json in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Glue Crawler Classifies json file as UNKNOWN

Tags:

json

amazon-web-services

pyspark

aws-glue

gscho

People also ask

2 Answers

Dominic Nguyen

user3056726

Recent Activity

Donate For Us