Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?

Just wondering how does limit work for the following simple query

select * from T limit 100

Imagine table T has 13 million records

Will the above query:
1. first load all 13 million into memory & display only 100 records in the result set ?
2. Loads only 100 & gives the result set of 100 records

Was searching for it for quite some time now, most of the pages only talk about using the "LIMIT" but not how Hive deals with it under the hood.

Any useful response appreciated.

like image 213
Alekhya Vemavarapu Avatar asked Sep 25 '17 17:09

Alekhya Vemavarapu


People also ask

How does limit work in Hive?

LIMIT takes one or two numeric arguments, which must both be non-negative integer constants. The first argument specifies the offset of the first row to return (as of Hive 2.0. 0) and the second specifies the maximum number of rows to return.

Can we use limit with order by in Hive?

LIMIT clause is optional with the ORDER BY clause. LIMIT clause can be used to improve the performance. LIMIT clause can be used to avoid unnecessary data processing.

Will the reducer work or not if you use limit 1 in any HiveQL query?

Reducer will not run if we use limit in select clause.

How do I make my Hive query run faster?

First, tweak your data through partitioning, bucketing, compression, etc. Improving the execution of a hive query is another Hive query optimization technique. You can do this by using Tez, avoiding skew, and increasing parallel execution.


1 Answers

If no optimizer applied, hive end up scanning entire table. But Hive optimizes this with hive.fetch.task.conversion released as part of HIVE-2925, To ease simple queries with simple conditions and not to run MR/Tez at all.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)

minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only

more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

Your question is more likely what happens when minimal or more is set. It just scans through the added files and read rows until reach leastRows() For more refer gitCode, Config and here

like image 84
rbyndoor Avatar answered Oct 25 '22 02:10

rbyndoor