Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort (order by) big data with hive efficiently?

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl

Am I missing something, or is it that hive simply isn't the right hammer for this job?

like image 455
ihadanny Avatar asked Sep 04 '25 01:09

ihadanny


2 Answers

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.

Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

like image 82
David Gruzman Avatar answered Sep 07 '25 10:09

David Gruzman


It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .

It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.

like image 37
Thejas Nair Avatar answered Sep 07 '25 12:09

Thejas Nair