The documentation of <code>HIVE</code> notes that <code>LIMIT</code> clause <code>returns rows chosen at random</code>. I have been running a <code>SELECT</code> table on a table with more than <code>800,000</code> records with <code>LIMIT 1</code>, but it always return me the same record. I'm using the <code>Shark</code> distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated. Thanks, Visakh

Even though the documentation states it returns rows at random, it's not actually true. It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined. As soon as you slap a <code>order by x DESC limit 5</code> on there, it returns the last 5 rows of whatever you're selecting from. To get rows returned at random, you would need to use something like: <code>order by rand() LIMIT 1</code> However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

Is LIMIT clause in HIVE really random?

Tags:

sql

hive

hiveql

shark-sql

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record.

I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

Thanks, Visakh

960

asked May 22 '14 08:05

visakh

1 Answers

Even though the documentation states it returns rows at random, it's not actually true.

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

161

answered Sep 20 '22 08:09

user3036342

Related questions
                            
                                What are the down sides of using a composite/compound primary key?
                            
                                CTE to traverse back up a hierarchy?
                            
                                How to use MySQL index columns?
                            
                                About index and primary key in SQL?
                            
                                SQLite long to wide formats?
                            
                                Which jar to use for connecting to MS SQL server
                            
                                Select or boolean aggregate function in PostgreSQL
                            
                                sql server 2008 management studio not checking the syntax of my query
                            
                                How to find which columns don't have any data (all values are NULL)?
                            
                                Is a SQL 'not in' more 'expensive' than a SQL 'in'?
                            
                                users assigned a sql azure role
                            
                                How to convert int to date in SQL Server 2008
                            
                                Nesting Aggregate Functions - SQL
                            
                                How can I check SQL syntax for a JDBC statement without running the actual query?
                            
                                commit after select
                            
                                unknown database in jdbc
                            
                                Is order in a subquery guaranteed to be preserved?
                            
                                Unable to restore bacpac due to foreign key conflict
                            
                                Many tables or rows, which one is more efficient in SQL?
                            
                                How to call mysql function using querydsl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With