Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark and Hive in Hadoop 3: Difference between metastore.catalog.default and spark.sql.catalogImplementation

I'm working on a Hadoop cluster (HDP) with Hadoop 3. Spark and Hive are also installed.

Since Spark and Hive catalogs are separated, it's a bit confusing sometimes, to know how and where to save data in a Spark application.

I know, that the property spark.sql.catalogImplementation can be set to either in-memory (to use a Spark session-based catalog) or hive (using Hive catalog for persistent metadata storing -> but the metadata is still separated from the Hive DBs and tables).

I'm wondering what the property metastore.catalog.default does. When I set this to hive I can see my Hive tables, but since the tables are stored in the /warehouse/tablespace/managed/hive directory in HDFS, my user has no access to this directory (because hive is of course the owner).

So, why should I set the metastore.catalog.default = hive, if I can't access the tables from Spark? Does it have something to do with Hortonwork's Hive Warehouse Connector?

Thank you for your help.

like image 787
D. Müller Avatar asked Jan 24 '20 10:01

D. Müller


People also ask

What is difference between spark SQL and Hive?

Hive and Spark are both immensely popular tools in the big data world. Hive is the best option for performing data analytics on large volumes of data using SQLs. Spark, on the other hand, is the best option for running big data analytics. It provides a faster, more modern alternative to MapReduce.

What is Hive catalog and spark catalog?

A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.

How spark and Hive uses two different catalogs?

There is two catalog implementations : in-memory to create in-memory tables only available in the Spark session, hive to create persistent tables using an external Hive Metastore.

Does spark need Hive Metastore?

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.

Is there a hive metastore available in Spark SQL?

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.

Does Apache Spark use Hadoop?

Spark bootstraps a pseudo-Metastore (embedded Derby DB) for internal use, and optionally uses an actual Hive Metastore to read/write persistent Hadoop data. Which does not mean that Spark uses Hive I/O libs, just the Hive meta-data. – Samson Scharfrichter May 9 '17 at 19:19

What is the difference between Apache Spark and Hive?

Apache Spark support multiple languages for its purpose. Speed: – The operations in Hive are slower than Apache Spark in terms of memory and disk processing as Hive runs on top of Hadoop. Read/Write operations: – The number of read/write operations in Hive are greater than in Apache Spark.

What is the difference between Spark-warehouse and metastore_DB?

A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.


Video Answer


1 Answers

Catalog implementations

There is two catalog implementations :

  • in-memory to create in-memory tables only available in the Spark session,
  • hive to create persistent tables using an external Hive Metastore.

More details here.

Metastore catalog

In the same Hive Metastore can coexist multiple catalogs. For example HDP versions from 3.1.0 to 3.1.4 use a different catalog to save Spark tables and Hive tables.
You may want to use metastore.catalog.default=hive to read Hive external tables using Spark API. The table location in HDFS must be accessible to the user running the Spark app.

HDP 3.1.4 documentation

You can get informations on access patterns according to Hive table type, read/write features and security requirements in the following links :

  • Apache Hive 3 tables
  • Hive Warehouse Connector for accessing Apache Spark data
  • Using the Hive Warehouse Connector with Spark
like image 191
Arthur PICHOT UTRERA Avatar answered Oct 12 '22 05:10

Arthur PICHOT UTRERA