Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena: use "folder" name as partition

I have thousands of individual json files (corresponding to one Table row) stored in s3 with the following path: s3://my-bucket/<date>/dataXX.json

When I create my table in DDL, is it possible to have the data partitioned by the present in the S3 path ? (or at least add the value in a new column)

Thanks

like image 688
Raphael Avatar asked Mar 01 '17 09:03

Raphael


People also ask

How do I create a partition in Athena?

Run ALTER TABLE ADD PARTITION. Because the data is not in Hive format, you cannot use the MSCK REPAIR TABLE command to add the partitions to the table after you create it. Instead, you can use the ALTER TABLE ADD PARTITION command to add each partition manually.

What is hive style partitioning?

Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.

What does MSCK command do?

The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3.

Why AWS Athena is so slow?

Your queries might have a higher queuing time because of high resource usage in the backend. The queuing time in Athena is dependent on resource allocation. After you submit your queries to Athena, the queries are processed by assigning resources based on the following: Overall service load.


2 Answers

It is possible to do this now using storage.location.template. This will partition by some part of your path. Be sure to NOT include the new column in the column list, as it will automatically be added. There are a lot of options you can search to tweak this for your date example. I used "id" to show the simplest version i could think of.

CREATE EXTERNAL TABLE `some_table`(
  `col1` bigint, 
PARTITIONED BY (
  `id` string
  )
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
  's3://path/bucket/'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'projection.enabled'='true', 
  'projection.id.type' = 'injected',
  'storage.location.template'='s3://path/bucket/${id}/'
  )

official docs: https://docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html

like image 129
Jeremy Giaco Avatar answered Sep 17 '22 16:09

Jeremy Giaco


Its not necessary to do this manually. Setup a glue crawler and it will pick-up the folder( in the prefix) as a partition, if all the folders in the path has the same structure and all the data has the same schema design.

Put it will name the partition as partition0. You can go into edit-schema and change the name of this partition to date or whatever you like.

But make sure you go into your glue crawler and under "configuration options" select the option - "Add new columns only". Otherwise on the next glue-crawler run it will reset the partition name back to partition0.

like image 30
Venkat.V.S Avatar answered Sep 21 '22 16:09

Venkat.V.S