I am developing a Spark SQL application and I've got few questions:
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.
Usage: – Hive is a distributed data warehouse platform which can store the data in form of tables like relational databases whereas Spark is an analytical platform which is used to perform complex data analytics on big data.
Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Using HiveContext, you can create and find tables in the HiveMetaStore and write queries on it using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.
to connect to hive metastore you need to copy the hive-site. xml file into spark/conf directory. After that spark will be able to connect to hive metastore.
I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory
non-Hive catalogs unless you're in spark-shell
that does the opposite).
The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive
and in-memory
.
Use the SparkSession
to know what catalog is in use.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> spark.version
res0: String = 2.4.0
scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener
scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog@49d5b651
Please note that I used spark-shell
that does start a Hive-aware SparkSession
and so I had to start it with --conf spark.sql.catalogImplementation=in-memory
to turn it off.
I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.
That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).
Is there any reason to use Hive?
No.
But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE
for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.
Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.
I don't see any reason to use Hive.
I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.
Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With