I am developing a Spark SQL application and I've got few questions: <ol> <li>I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.</li> <li>I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct?</li> </ol>

<blockquote> I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation. </blockquote> Spark SQL does not use a Hive metastore under the covers (and defaults to <code>in-memory</code> non-Hive catalogs unless you're in <code>spark-shell</code> that does the opposite). The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: <code>hive</code> and <code>in-memory</code>. Use the <code>SparkSession</code> to know what catalog is in use. <pre class="prettyprint"><code>scala> :type spark org.apache.spark.sql.SparkSession scala> spark.version res0: String = 2.4.0 scala> :type spark.sharedState.externalCatalog org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener scala> println(spark.sharedState.externalCatalog.unwrapped) org.apache.spark.sql.hive.HiveExternalCatalog@49d5b651 </code></pre> Please note that I used <code>spark-shell</code> that does start a Hive-aware <code>SparkSession</code> and so I had to start it with <code>--conf spark.sql.catalogImplementation=in-memory</code> to turn it off. <blockquote> I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. </blockquote> That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules). <blockquote> Is there any reason to use Hive? </blockquote> No. But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as <code>ANALYZE TABLE</code> for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost. Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work. <blockquote> I don't see any reason to use Hive. </blockquote> I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments. Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).

Does Spark SQL use Hive Metastore?

1 Answers

I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite).

The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.

Use the SparkSession to know what catalog is in use.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.version
res0: String = 2.4.0

scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener

scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog@49d5b651

Please note that I used spark-shell that does start a Hive-aware SparkSession and so I had to start it with --conf spark.sql.catalogImplementation=in-memory to turn it off.

I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.

That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).

Is there any reason to use Hive?

No.

But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.

Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.

I don't see any reason to use Hive.

I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.

Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).

118

answered Oct 04 '22 15:10

Jacek Laskowski

Related questions
                            
                                Spark without Hadoop: Failed to Launch
                            
                                converting pandas dataframes to spark dataframe in zeppelin
                            
                                Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
                            
                                Creating Spark dataframe from numpy matrix
                            
                                Why does Spark Planner prefer sort merge join over shuffled hash join?
                            
                                Kafka topic partitions to Spark streaming
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
                            
                                Why does Spark job fail with "Exit code: 52"
                            
                                How to explode columns?
                            
                                Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
                            
                                How to get word details from TF Vector RDD in Spark ML Lib?
                            
                                Cleaning up Spark history logs
                            
                                Partitioning by multiple columns in PySpark with columns in a list
                            
                                Sparksql filtering (selecting with where clause) with multiple conditions
                            
                                How to count a boolean in grouped Spark data frame
                            
                                Spark Dataframe validating column names for parquet writes
                            
                                How jobs are assigned to executors in Spark Streaming?
                            
                                How to use constant value in UDF of Spark SQL(DataFrame)
                            
                                Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification
                            
                                How to join Datasets on multiple columns?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does Spark SQL use Hive Metastore?

Tags:

apache-spark

apache-spark-sql

hive

user1888243

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us