Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Glue: Do I really need a Crawler for new content?

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?

In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?

like image 790
Jiew Meng Avatar asked Nov 03 '18 02:11

Jiew Meng


People also ask

Is crawler is mandatory in AWS Glue?

No. you don't need to create a crawler to run Glue Job. Crawler can read multiple datasources and keep Glue Catalog up to date.

Why do you need a glue crawler?

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog.

What does a crawler do AWS Glue?

What is a crawler? A crawler is a job defined in Amazon Glue. It crawls databases and buckets in S3 and then creates tables in Amazon Glue together with their schema. Then, you can perform your data operations in Glue, like ETL.


1 Answers

If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.

However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.

Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.

The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.

like image 82
Yuriy Bondaruk Avatar answered Nov 08 '22 23:11

Yuriy Bondaruk