Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is Select Count(*) slower than Select * in hive

Tags:

When i am running queries in VirtualBox Sandbox with hive. I feel Select count(*) is too much slower than the Select *.

Can anyone explain what is going on behind?

And why this delay is happening?

like image 865
Haris N I Avatar asked Jun 12 '13 08:06

Haris N I


People also ask

Why is select * from Table query faster than select count (*) from table in Hive?

This is because the DB is using clustered primary keys so the query searches each row for the key individually, row by agonizing row, not from an index. Run optimize table . This will ensure that the data pages are physically stored in sorted order.

What is the difference between select * and select count (*)?

Select * Would return the entire table while Select Count(*) would return the number of rows.

How do you make the select count faster?

So to make SELECT COUNT(*) queries fast, here's what to do: Get on any version that supports batch mode on columnstore indexes, and put a columnstore index on the table – although your experiences are going to vary dramatically depending on the kind of query you have.

How do I make Hive query run faster?

Performance tuning is key to optimizing a Hive query. First, tweak your data through partitioning, bucketing, compression, etc. Improving the execution of a hive query is another Hive query optimization technique. You can do this by using Tez, avoiding skew, and increasing parallel execution.


2 Answers

select * from table

It can be a Map only job But

Select Count(*) from table 

It can be a Map and Reduce job

Hope this helps.

like image 157
Mask Avatar answered Sep 19 '22 04:09

Mask


There are three types of operations that a hive query can perform.

In order of cheapest and fastest to more expensive and slower here they are.

A hive query can be a metadata only request.

Show tables, describe table are examples. In these queries the hive process performs a lookup in the metadata server. The metadata server is a SQL database, probably MySQL, but the actual DB is configurable.

A hive query can be an hdfs get request. Select * from table, would be an example. In this case hive can return the results by performing an hdfs operation. hadoop fs -get, more or less.

A hive query can be a Map Reduce job.

Hive has to ship the jar to hdfs, the jobtracker queues the tasks, the tasktracker execute the tasks, the final data is put into hdfs or shipped to the client.

The Map Reduce job has different possibilities as well.

It can be a Map only job. Select * from table where id > 100 , for example all of that logic can be applied on the mapper.

It can be a Map and Reduce job, Select min(id) from table; Select * from table order by id ;

It can also lead to multiple map Reduce passes, but I think the above summarizes some behaviors.

like image 25
Pearl90 Avatar answered Sep 22 '22 04:09

Pearl90