I have thousands of individual json files (corresponding to one Table row) stored in s3 with the following path: s3://my-bucket/<date>/dataXX.json
When I create my table in DDL, is it possible to have the data partitioned by the present in the S3 path ? (or at least add the value in a new column)
Thanks
Run ALTER TABLE ADD PARTITION. Because the data is not in Hive format, you cannot use the MSCK REPAIR TABLE command to add the partitions to the table after you create it. Instead, you can use the ALTER TABLE ADD PARTITION command to add each partition manually.
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date, city, and department. Using partition, it is easy to query a portion of the data.
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3.
Your queries might have a higher queuing time because of high resource usage in the backend. The queuing time in Athena is dependent on resource allocation. After you submit your queries to Athena, the queries are processed by assigning resources based on the following: Overall service load.
It is possible to do this now using storage.location.template. This will partition by some part of your path. Be sure to NOT include the new column in the column list, as it will automatically be added. There are a lot of options you can search to tweak this for your date example. I used "id" to show the simplest version i could think of.
CREATE EXTERNAL TABLE `some_table`(
`col1` bigint,
PARTITIONED BY (
`id` string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://path/bucket/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'projection.enabled'='true',
'projection.id.type' = 'injected',
'storage.location.template'='s3://path/bucket/${id}/'
)
official docs: https://docs.amazonaws.cn/en_us/athena/latest/ug/partition-projection-dynamic-id-partitioning.html
Its not necessary to do this manually. Setup a glue crawler and it will pick-up the folder( in the prefix) as a partition, if all the folders in the path has the same structure and all the data has the same schema design.
Put it will name the partition as partition0. You can go into edit-schema and change the name of this partition to date or whatever you like.
But make sure you go into your glue crawler and under "configuration options" select the option - "Add new columns only". Otherwise on the next glue-crawler run it will reset the partition name back to partition0.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With