My question is rather simple, but somehow I cannot find a clear answer by reading the documentation. I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore. I create a session in my Spark program as follows: <pre class="prettyprint"><code>SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate() </code></pre> Suppose I have the following HiveQL query: <pre class="prettyprint"><code>spark.sql("SELECT someColumn FROM someTable") </code></pre> I would like to know whether: <ol> <li>under the hood this query is translated into Hive MapReduce primitives, or </li> <li>the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.</li> </ol> I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with <code>spark.sql([hiveQL query])</code> refer to Spark or Hive.

Spark knows two catalogs, hive and in-memory. If you set <code>enableHiveSupport()</code>, then <code>spark.sql.catalogImplementation</code> is set to <code>hive</code>, otherwise to <code>in-memory</code>. So if you enable hive support, <code>spark.catalog.listTables().show()</code> will show you all tables from the hive metastore. But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark. *there are actually some functions like <code>percentile</code> und <code>percentile_approx</code> which are native hive UDAF.

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

Tags:

apache-spark

apache-spark-sql

hive

hiveql

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation.

I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore.

I create a session in my Spark program as follows:

SparkSession spark = SparkSession.builder().appName("MyApp").enableHiveSupport().getOrCreate()

Suppose I have the following HiveQL query:

spark.sql("SELECT someColumn FROM someTable")

I would like to know whether:

under the hood this query is translated into Hive MapReduce primitives, or
the support for HiveQL is only at a syntactical level and Spark SQL will be used under the hood.

I am doing some performance evaluation and I don't know whether I should claim the time performance of queries executed with spark.sql([hiveQL query]) refer to Spark or Hive.

819

asked Sep 04 '18 15:09

Anthony Arrascue

1 Answers

Spark knows two catalogs, hive and in-memory. If you set enableHiveSupport(), then spark.sql.catalogImplementation is set to hive, otherwise to in-memory. So if you enable hive support, spark.catalog.listTables().show() will show you all tables from the hive metastore.

But this does not mean hive is used for the query*, it just means that spark communicates with the hive-metastore, the execution engine is always spark.

*there are actually some functions like percentile und percentile_approx which are native hive UDAF.

183

answered Sep 24 '22 18:09

Raphael Roth

Related questions
                            
                                How to convert an RDD[Row] back to DataFrame [duplicate]
                            
                                Write Spark dataframe as CSV with partitions
                            
                                Partitioning by multiple columns in Spark SQL
                            
                                AttributeError: 'SparkContext' object has no attribute 'createDataFrame' using Spark 1.6
                            
                                Spark Dataframe Nested Case When Statement
                            
                                Spark: Programmatically creating dataframe schema in scala
                            
                                How to get the correlation matrix of a pyspark data frame?
                            
                                Spark - scala: shuffle RDD / split RDD into two random parts randomly
                            
                                Spark streaming custom metrics
                            
                                Reading csv files in zeppelin using spark-csv
                            
                                Check Type: How to check if something is a RDD or a DataFrame?
                            
                                How to fix spark-shell on Windows (fails with "was unexpected at this time")? [closed]
                            
                                No module named 'resource' installing Apache Spark on Windows
                            
                                how to check if a string column in pyspark dataframe is all numeric
                            
                                Spark: How to save a dataframe with headers?
                            
                                How to convert a table into a Spark Dataframe
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/Logging
                            
                                TaskSchedulerImpl: Initial job has not accepted any resources;
                            
                                ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]
                            
                                Count number of words in a spark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With