Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS update Athena meta: Glue Crawler vs MSCK Repair Table

When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?

like image 663
Fang Zhang Avatar asked May 23 '19 19:05

Fang Zhang


2 Answers

MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.

Here is pricing information:

Glue Crawler:

There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.

Pricing

For all AWS Regions where AWS Glue is available: $0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

Athena:

There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.

However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?

My opinion is it is best to manage the partition yourself if you are able to, after adding new data.

'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'

If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.

like image 178
KiteCoder Avatar answered Oct 15 '22 01:10

KiteCoder


I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from @KiteCoder, or you can do this via the Glue API directly.

Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.

The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.

like image 32
Kirk Broadhurst Avatar answered Oct 15 '22 01:10

Kirk Broadhurst