I'm in the process of improving the performance of a table.
Say this table:
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
STORED AS PARQUET;
I'm planning to apply bucketing by user_id, as the queries usually involve user_id as a clause.
like this
CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
CLUSTERED BY(user_id) INTO 256 BUCKETS
STORED AS PARQUET;
This table will be created and loaded with Hive, and queried from Impala...
What i wanted to know is, whether bucketing this table will improve the performance of impala queries - I'm not sure how impala works with buckets.
I tried creating a bucketed and non-bucketed table table through Hive (which is a table 6GB in size)
I tried benchmarking the results from both. There is slight/no difference.
I also tried analyzing the profile of both the queries, which didn't show much difference.
So the answer is, Impala doesn't know whether a table is bucketed or not, so it doesn't take advantage of it (IMPALA-1990). The only way it becomes aware of the partitions and files in the table is with COMPUTE STATS
By the way, bucketing the tables used by Impala is not wasteful. If we have to limit the number of small files in the table, we can bucket it and switch on Hive transactions (available from Hive 0.13.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With