I'm working on a Hadoop cluster (HDP) with Hadoop 3. Spark and Hive are also installed.
Since Spark and Hive catalogs are separated, it's a bit confusing sometimes, to know how and where to save data in a Spark application.
I know, that the property spark.sql.catalogImplementation
can be set to either in-memory
(to use a Spark session-based catalog) or hive
(using Hive catalog for persistent metadata storing -> but the metadata is still separated from the Hive DBs and tables).
I'm wondering what the property metastore.catalog.default
does. When I set this to hive
I can see my Hive tables, but since the tables are stored in the /warehouse/tablespace/managed/hive
directory in HDFS, my user has no access to this directory (because hive is of course the owner).
So, why should I set the metastore.catalog.default = hive
, if I can't access the tables from Spark? Does it have something to do with Hortonwork's Hive Warehouse Connector?
Thank you for your help.
Hive and Spark are both immensely popular tools in the big data world. Hive is the best option for performing data analytics on large volumes of data using SQLs. Spark, on the other hand, is the best option for running big data analytics. It provides a faster, more modern alternative to MapReduce.
A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.
There is two catalog implementations : in-memory to create in-memory tables only available in the Spark session, hive to create persistent tables using an external Hive Metastore.
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.
Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite). The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.
Spark bootstraps a pseudo-Metastore (embedded Derby DB) for internal use, and optionally uses an actual Hive Metastore to read/write persistent Hadoop data. Which does not mean that Spark uses Hive I/O libs, just the Hive meta-data. – Samson Scharfrichter May 9 '17 at 19:19
Apache Spark support multiple languages for its purpose. Speed: – The operations in Hive are slower than Apache Spark in terms of memory and disk processing as Hive runs on top of Hadoop. Read/Write operations: – The number of read/write operations in Hive are greater than in Apache Spark.
A Hive metastore warehouse (aka spark-warehouse) is the directory where Spark SQL persists tables whereas a Hive metastore (aka metastore_db) is a relational database to manage the metadata of the persistent relational entities, e.g. databases, tables, columns, partitions.
There is two catalog implementations :
in-memory
to create in-memory tables only available in the Spark session,hive
to create persistent tables using an external Hive Metastore.More details here.
In the same Hive Metastore can coexist multiple catalogs.
For example HDP versions from 3.1.0 to 3.1.4 use a different catalog to save Spark tables and Hive tables.
You may want to use metastore.catalog.default=hive
to read Hive external tables using Spark API. The table location in HDFS must be accessible to the user running the Spark app.
You can get informations on access patterns according to Hive table type, read/write features and security requirements in the following links :
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With