What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
No. you don't need to create a crawler to run Glue Job. Crawler can read multiple datasources and keep Glue Catalog up to date.
You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.
What is a crawler? A crawler is a job defined in Amazon Glue. It crawls databases and buckets in S3 and then creates tables in Amazon Glue together with their schema. Then, you can perform your data operations in Glue, like ETL.
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2
) then you need a crawler to register newly added partitions (ie. /day=3
) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table>
or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE
Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With