Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive vs SQL Server performance

1) I started using hive from last 2 months. I have a same task as that in SQL. I found that Hive is slow and takes more time to execute queries while SQL executes it in very few minutes/seconds.

After executing the task in Hive when I cross check the result in both (SQL and Hive), I found some difference in results (Not all but in some tables). e.g. : I have one table which has 2012 records, when I executed a task in Hive in the same table in Hive I got 2007 records.

Why it is happening?

2) If I think to speed up my execution in Hive then what should I do for it? (Currently I am executing all this stuff on single cluster only. If I think to increase the clusters then how many cluster should I need it to increase the performance)

Please suggest me some solution or some good practices so that I can do it keenly.

Thanks.

like image 960
Bhavesh Shah Avatar asked Apr 03 '12 06:04

Bhavesh Shah


2 Answers

  • Hive and SQL Server are not comparable in any way other than the similarity in the syntax of the query language.
  • While SQL Server is built to be able to respond in realtime from a single machine, hive is for processing large data sets that may span hundreds or thousands of machines.
  • Hive (via hadoop) has a lot of overhead for starting up a job.
  • Hive and hadoop will not cache data in memory like sql server does.
  • Hive has only recent added indexes so most queries end up being a table scan.
  • If your dataset fits on a single computer you probably want to stick with SQL Server and not hive. Hive performance tuning is mostly based in Hadoop performance tuning although depending on the types of queries you run there can be free performance from using the LazyBinarySerDe.

Hive does have some differences from regular SQL that may be effecting your query. Without more details I can't speculate as to why.

like image 181
Steve Severance Avatar answered Nov 17 '22 17:11

Steve Severance


Ignore the "they aren't comparable in any way" comment. If it stores data, it is comparable to any other method of storing data.

But be aware that SQL Server, 13 years ago, had 1000+ people being paid full-time to improve their product. So while that doesn't "Prove" anything, it does increase ones confidence that more work = more results.

More importantly, look for any non-trivial benchmark done on an open source and/or non-relational method of storing data vs one of the mainstream relational databases. You won't find them. That says a lot to me. (Also, mainstream isn't necessary since the current world's fastest data engine isn't even mainstream. But if that level is needed, look at ExoSol.)

If your need is to learn to work with technology at your job and that technology is Hive, my recommendation is to find someone who is really focused on getting the most out of Hive query performance as possible. If there is a Hive query guru out there, find them. But if you need a lot more than what they can give you, you're using the wrong technology.

And if Hive isn't a requirement, I would avoid it and other technologies lacking the compelling business model that will guarantee their survival past 5 years and move them out of niche category they currently exist in (currently 20 times less popular than any mainstream data engine - https://db-engines.com/en/ranking).

like image 1
Jeff Winchell Avatar answered Nov 17 '22 17:11

Jeff Winchell