Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.

I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.

I create a session in my Spark program as follows:

SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()

Suppose I have the following HiveQL query:

spark.sql("SELECT someColumn FROM someTable")

I would like to know whether:

  1. under the hood this query is translated into Hive MapReduce primitives, or
  2. the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.

I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

like image 819
Anthony Arrascue Avatar asked Sep 04 '18 15:09

Anthony Arrascue


People also ask

What is Spark 2.0 SparkSession?

SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet.

Do I need to close SparkSession?

You should always close your SparkSession when you are done with its use (even if the final outcome were just to follow a good practice of giving back what you've been given). Closing a SparkSession may trigger freeing cluster resources that could be given to some other application.

What is enableHiveSupport?

enableHiveSupport () Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

What is SparkSession builder getOrCreate ()?

getOrCreate () Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder. New in version 2.0. 0. This method first checks whether there is a valid global default SparkSession, and if yes, return that one.


1 Answers

Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport(), then spark.sql.catalogImplementation is set to hive, otherwise to in-memory. So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore.

But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.

*there are actually some functions like percentile und percentile_approx which are native hive UDAF.

like image 183
Raphael Roth Avatar answered Sep 24 '22 18:09

Raphael Roth