Just wondering how does limit work for the following simple query <pre class="prettyprint"><code>select * from T limit 100 </code></pre> Imagine table T has 13 million records Will the above query: 1. first load all 13 million into memory & display only 100 records in the result set ? 2. Loads only 100 & gives the result set of 100 records Was searching for it for quite some time now, most of the pages only talk about using the "LIMIT" but not how Hive deals with it under the hood. Any useful response appreciated.

<blockquote> If no optimizer applied, hive end up scanning entire table. But Hive optimizes this with hive.fetch.task.conversion released as part of HIVE-2925, To ease simple queries with simple conditions and not to run MR/Tez at all. Supported values are none, minimal and more. none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389) minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns) Your question is more likely what happens when minimal or more is set. It just scans through the added files and read rows until reach leastRows() For more refer gitCode, Config and here </blockquote>

HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?

Tags:

memory

limit

hadoop

hive

Just wondering how does limit work for the following simple query

select * from T limit 100

Imagine table T has 13 million records

Will the above query:
1. first load all 13 million into memory & display only 100 records in the result set ?
2. Loads only 100 & gives the result set of 100 records

Was searching for it for quite some time now, most of the pages only talk about using the "LIMIT" but not how Hive deals with it under the hood.

Any useful response appreciated.

213

asked Sep 25 '17 17:09

Alekhya Vemavarapu

1 Answers

If no optimizer applied, hive end up scanning entire table. But Hive optimizes this with hive.fetch.task.conversion released as part of HIVE-2925, To ease simple queries with simple conditions and not to run MR/Tez at all.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)

minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only

more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

Your question is more likely what happens when minimal or more is set. It just scans through the added files and read rows until reach leastRows() For more refer gitCode, Config and here

answered Oct 25 '22 02:10

rbyndoor

Related questions
                            
                                A mapreduce job with plain text input and avro output
                            
                                Rails with Hadoop
                            
                                How to find optimal number of mappers when running Sqoop import and export?
                            
                                Hadoop MapReduce read the data set once for multiple jobs
                            
                                Documentation for installing and running hadoop 2.2 on Windows [closed]
                            
                                Hadoop 2.2.0 is compatible with Mahout 0.8?
                            
                                Subclassing Avro record?
                            
                                Hive - Checking if an array in each row of a table contains any matching data in a column in another table
                            
                                Write to multiple outputs by key Scalding Hadoop, one MapReduce Job
                            
                                How to query when connecting mongodb with apache-spark
                            
                                Hadoop DistributedCache functionality in Spark
                            
                                Custom SerDe not supported by Impala, what's the best way to query files in CSV w/double quotes?
                            
                                MongoDB into AWS Redshift
                            
                                hive external partitioned table
                            
                                How do I install Python libraries automatically on Dataproc cluster startup?
                            
                                hadoop yarn: show the pending resoure request of an application
                            
                                What is the difference between HUE, YARN and OOZIE
                            
                                Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found long, expecting union in hive
                            
                                How to check whether the file exist in HDFS location, using oozie?
                            
                                brew install hadoop installing 2.8.1 version. But needed 2.7.4 version

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With