After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question.
Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance.
He uses the latest Hive, which from seems to be using Tez.
Hive is the best option for performing data analytics on large volumes of data using SQLs. Spark, on the other hand, is the best option for running big data analytics. It provides a faster, more modern alternative to MapReduce.
What is Apache Spark? Apache Spark — which is also open source — is a data processing engine for big data sets. Like Hadoop, Spark splits up large tasks across different nodes. However, it tends to perform faster than Hadoop and it uses random access memory (RAM) to cache and process data instead of a file system.
Furthermore, Apache Hive provides a similar interface to receive data in Hadoop cluster. Hence, it is a great way to start faster data analyzing. Let's look into the technical aspects that make Hive faster during the processing of queries.
Spark is more for mainstream developers, while Tez is a framework for purpose-built tools. Spark can't run concurrently with YARN applications (yet). Tez is purposefully built to execute on top of YARN. Tez's containers can shut down when finished to save resources.
Hive is just a framework that gives sql functionality to MapReduce type workloads.
These workloads can run on mapreduce or yarn.
So comparing Hive on tez vs Hive on spark. Nice article below discussing this When to go with ETL on Hive using Tez VS When to go with Spark ETL? (Gist use Hive on spark if not sure).
Lower the better
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With