How to sort (order by) big data with hive efficiently?

Question

I want to sort a big dataset efficiently (i.e. with a custom partitioner, like described here: How does the MapReduce sort algorithm work?), but I want to do it with hive.

However, the Hive manual states that "order by" is performed by a single reducer. This surprises me, as pig does implement something similar to the article - pig impl

Am I missing something, or is it that hive simply isn't the right hammer for this job?

David Gruzman · Accepted Answer

I think that Hive is not right tool for the job. At least for now. It is built to be used as OLAP/Report tool and thereof is not optimized to produce large result datasets, since most of the analytical queries produce relatively small result set. As a result - they have good TOP N capability but not good total order.

Just in case if you didn't encounter it before - I am suggesting to look inte Hadoop's terasort example, which is specifically aimed to sort large dataset in a best possible way using MR. http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html

Thejas Nair · Answer

It is not possible to use multiple reducers for doing total ordering in Hive. It has not been implemented yet - https://issues.apache.org/jira/browse/HIVE-1402 .

It will be easier to use pig instead of writing custom MR job, if you want efficient total ordering.

How to sort (order by) big data with hive efficiently?

Tags:

hadoop

hive

apache-pig

mapreduce

ihadanny

2 Answers

David Gruzman

Thejas Nair

Recent Activity

Donate For Us

How to sort (order by) big data with hive efficiently?

Tags:

hadoop

hive

apache-pig

mapreduce

ihadanny

2 Answers

David Gruzman

Thejas Nair

Related questions

Recent Activity

Donate For Us