I am new to Spark.I found using HiveContext
we can connect to hive
and run HiveQL
s. I run it and it worked.
My doubt is whether Spark
does it through spark jobs
.That is, it uses HiveContext
only for accessing corresponding hive table files from HDFS
Or
It internally calls hive to execute the query?
public class HiveContext extends SQLContext implements Logging. An instance of the Spark SQL execution engine that integrates with data stored in Hive. Configuration for Hive is read from hive-site. xml on the classpath.
HiveContext is a super set of the SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. And if you want to work with Hive you have to use HiveContext, obviously.
Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution. If Hive dependencies can be found on the classpath, Spark will load them automatically.
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and data.
No, Spark doesn't call the hive to execute query. Spark only reads the metadata from hive and executes the query within Spark engine. Spark has it's own SQL execution engine which includes components such as catalyst, tungsten to optimize queries and give faster results. It uses meta data from hive and execution engine of spark to run the queries.
One of the greatest advantages of Hive is it's metastore. It acts as a single meta store for lot of components in hadoop eco system.
Coming to your question, when you use HiveContext, it will get access to metastore db and all your Hive Meta Data, which can clearly explain what type of data you have , where do you have the data , serialization and deserializations, compression codecs, columns, datatypes and literally every detail about the table and it's data. That is enough for spark to understand the data.
Overall, Spark only needs metastore which gives complete details of underlying data and once it has the metadata, it will execute the queries that you asked for, over its on execution engine. Hive is slower than Spark as it uses MapReduce. So, there is no point in going back to hive and ask to run it in hive.
Let me know if it answers ur question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With