How to avoid AWS Athena CTAS query creating small files?

Question

I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?

CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
    format = 'PARQUET'
    parquet_compression = 'GZIP',
    external_location='s3://mybucket/Athena/tables/parquet/'
    partitioned_by=ARRAY['year','month']
)
AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y')  AS year,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c')  AS month
FROM sampleDB.yellow_trip_data_raw;

image from my partition

jon · Accepted Answer

I was able to overcome the issue by creating a bucketing column month_a. Below is the code

CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
    format = 'AVRO',
    external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
    partitioned_by=ARRAY['year','month'],
    bucketed_by=ARRAY['month_a'],
    bucket_count=12
) AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

How to avoid AWS Athena CTAS query creating small files?

Tags:

amazon-web-services

amazon-athena

jon

1 Answers

jon

Recent Activity

Donate For Us

How to avoid AWS Athena CTAS query creating small files?

Tags:

amazon-web-services

amazon-athena

jon

1 Answers

jon

Related questions

Recent Activity

Donate For Us