How to indicate the database in SparkSQL over Hive in Spark 1.3

Question

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:

val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd

The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark". I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line

hc.sql("use myDatabase")

is throwing the following exception:

Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found

use myDatabase
^

The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.

WestCoastProjects · Accepted Answer

use database

is supported in later Spark versions

https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html

You need to put the statement in two separate spark.sql calls like this:

spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show

Joe Widen · Answer

Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"

Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.

When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.

How to indicate the database in SparkSQL over Hive in Spark 1.3

Tags:

database

apache-spark

apache-spark-sql

hive

Michael D

Video Answer

2 Answers

use database

WestCoastProjects

Joe Widen

Recent Activity

Donate For Us

How to indicate the database in SparkSQL over Hive in Spark 1.3

Tags:

database

apache-spark

apache-spark-sql

hive

Michael D

Video Answer

2 Answers

use database

WestCoastProjects

Joe Widen

Related questions

Recent Activity

Donate For Us