Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is LIMIT clause in HIVE really random?

The documentation of HIVE notes that LIMIT clause returns rows chosen at random. I have been running a SELECT table on a table with more than 800,000 records with LIMIT 1, but it always return me the same record.

I'm using the Shark distribution, and I am wondering whether this has got anything to do with this not expected behavior? Any thoughts would be appreciated.

Thanks, Visakh

like image 960
visakh Avatar asked May 22 '14 08:05

visakh


People also ask

How does LIMIT work in Hive?

LIMIT takes one or two numeric arguments, which must both be non-negative integer constants. The first argument specifies the offset of the first row to return (as of Hive 2.0. 0) and the second specifies the maximum number of rows to return.

Is LIMIT in SQL random?

The LIMIT clause picks the first row in the result set sorted randomly.

Which clause is used to LIMIT the number of rows in Hive?

The LIMIT clause is used to set a ceiling on the number of rows in the result set.


1 Answers

Even though the documentation states it returns rows at random, it's not actually true.

It returns "chosen rows at random" as it appears in the database without any where/order by clause. This means that it's not really random (or randomly chosen) as you would think, just that the order the rows are returned in can't be determined.

As soon as you slap a order by x DESC limit 5 on there, it returns the last 5 rows of whatever you're selecting from.

To get rows returned at random, you would need to use something like: order by rand() LIMIT 1

However it can have a speed impact if your indexes aren't setup properly. Usually I do a min/max to get the ID's on the table, and then do a random number between them, then select those records (in your case, would be just 1 record), which tends to be faster than having the database do the work, especially on a large dataset

like image 161
user3036342 Avatar answered Sep 20 '22 08:09

user3036342