Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation

Tags:

I'm working on a Hadoop cluster (HDP) with Hadoop 3. Spark and Hive are also installed.

Since Spark and Hive catalogs are separated, it's a bit confusing sometimes, to know how and where to save data in a Spark application.

I know, that the property spark.sql.catalogImplementation can be set to either in-memory (to use a Spark session-based catalog) or hive (using Hive catalog for persistent metadata storing -> but the metadata is still separated from the Hive DBs and tables).

I'm wondering what the property metastore.catalog.default does. When I set this to hive I can see my Hive tables, but since the tables are stored in the /warehouse/tablespace/managed/hive directory in HDFS, my user has no access to this directory (because hive is of course the owner).

So, why should I set the metastore.catalog.default = hive, if I can't access the tables from Spark? Does it have something to do with Hortonwork's Hive Warehouse Connector?

Thank you for your help.

787

asked Jan 24 '20 10:01

D. Müller

Video Answer

1 Answers

Catalog implementations

There is two catalog implementations :

in-memory to create in-memory tables only available in the Spark session,
hive to create persistent tables using an external Hive Metastore.

More details here.

Metastore catalog

In the same Hive Metastore can coexist multiple catalogs. For example HDP versions from 3.1.0 to 3.1.4 use a different catalog to save Spark tables and Hive tables.
You may want to use metastore.catalog.default=hive to read Hive external tables using Spark API. The table location in HDFS must be accessible to the user running the Spark app.

HDP 3.1.4 documentation

You can get informations on access patterns according to Hive table type, read/write features and security requirements in the following links :

Apache Hive 3 tables
Hive Warehouse Connector for accessing Apache Spark data
Using the Hive Warehouse Connector with Spark

191

answered Oct 12 '22 05:10

Arthur PICHOT UTRERA

Related questions
                            
                                PySpark how to read file having string with multiple encoding
                            
                                Why does SparkSQL require two literal escape backslashes in the SQL query?
                            
                                Timestamp roundtrip from Spark Python to Pandas and back
                            
                                Load a file from SFTP server into spark RDD
                            
                                Structured Streaming - Foreach Sink
                            
                                Read data from remote hive on spark over JDBC returns empty result
                            
                                Why can't I display prediction column of Spark MultilayerPerceptronClassifier?
                            
                                How to add hbase-site.xml config file using spark-shell
                            
                                Re-run Spark jobs on Failure or Abort
                            
                                How do I use Spark ORC indexes?
                            
                                Get a registered Spark Accumulator by name
                            
                                Pyspark: spark-submit not working like CLI
                            
                                PySpark SparkSession Builder with Kubernetes Master
                            
                                Outer join two Datasets (not DataFrames) in Spark Structured Streaming
                            
                                In Spark ML, why is fitting a StringIndexer on a column with million of disctinct values yielding an OOM error?
                            
                                Spark Strucutured Streaming Window on non-timestamp column
                            
                                Access AWS Glue from local Spark
                            
                                PySpark: Deserializing an Avro serialized message contained in an eventhub capture avro file
                            
                                How to get the table name from Spark SQL Query [PySpark]?
                            
                                Fastest way to take elementwise sum of two Lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation

Tags:

apache-spark

hadoop

hive

hive-metastore

hadoop3