Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Impala makes effective use of Buckets in a Hive Bucketed table?

I'm in the process of improving the performance of a table.

Say this table:

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
STORED AS PARQUET;

I'm planning to apply bucketing by user_id, as the queries usually involve user_id as a clause.

like this

CREATE TABLE user_info_bucketed(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
PARTITIONED BY(Year int, month int)
CLUSTERED BY(user_id) INTO 256 BUCKETS
STORED AS PARQUET;

This table will be created and loaded with Hive, and queried from Impala...

What i wanted to know is, whether bucketing this table will improve the performance of impala queries - I'm not sure how impala works with buckets.

like image 719
Renganathan Mayavan Avatar asked Nov 01 '22 05:11

Renganathan Mayavan


1 Answers

I tried creating a bucketed and non-bucketed table table through Hive (which is a table 6GB in size)

I tried benchmarking the results from both. There is slight/no difference.

I also tried analyzing the profile of both the queries, which didn't show much difference.

So the answer is, Impala doesn't know whether a table is bucketed or not, so it doesn't take advantage of it (IMPALA-1990). The only way it becomes aware of the partitions and files in the table is with COMPUTE STATS

By the way, bucketing the tables used by Impala is not wasteful. If we have to limit the number of small files in the table, we can bucket it and switch on Hive transactions (available from Hive 0.13.0)

like image 56
4 revs, 3 users 65% Avatar answered Nov 11 '22 04:11

4 revs, 3 users 65%