I have a spark application which will successfully connect to hive and query on hive tables using spark engine. To build this, I just added <code>hive-site.xml</code> to classpath of the application and spark will read the <code>hive-site.xml</code> to connect to its metastore. This method was suggested in spark's mailing list. So far so good. Now I want to connect to two hive stores and I don't think adding another <code>hive-site.xml</code> to my classpath will be helpful. I referred quite a few articles and spark mailing lists but could not find anyone doing this. Can someone suggest how I can achieve this? Thanks. Docs referred: <ul> <li>Hive on Spark</li> <li>Spak docs</li> <li>HiveContext</li> </ul>

I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC. After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing. Environment details hadoop-2.6.0 apache-hive-2.0.0-bin spark-1.3.1-bin-hadoop2.6 Code Sample HiveMultiEnvironment.scala <pre class="prettyprint"><code>import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext import org.apache.spark.SparkContext object HiveMultiEnvironment { def main(args: Array[String]) { var conf = new SparkConf().setAppName("JDBC").setMaster("local") var sc = new SparkContext(conf) var sqlContext = new SQLContext(sc) // load hive table (or) sub-query from Environment 1 val jdbcDF1 = sqlContext.load("jdbc", Map( "url" -> "jdbc:hive2://<host1>:10000/<db>", "dbtable" -> "<db.tablename or subquery>", "driver" -> "org.apache.hive.jdbc.HiveDriver", "user" -> "<username>", "password" -> "<password>")) jdbcDF1.foreach { println } // load hive table (or) sub-query from Environment 2 val jdbcDF2 = sqlContext.load("jdbc", Map( "url" -> "jdbc:hive2://<host2>:10000/<db>", "dbtable" -> "<db.tablename> or <subquery>", "driver" -> "org.apache.hive.jdbc.HiveDriver", "user" -> "<username>", "password" -> "<password>")) jdbcDF2.foreach { println } } // todo: business logic } </code></pre> Other parameters can also be set during load using SqlContext such as setting partitionColumn. Details found under 'JDBC To Other Databases' section in Spark reference doc: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html Build path from Eclipse: <img src="https://i.stack.imgur.com/Wljz3.png" alt="enter image description here"> What I Haven't Tried Use of HiveContext for Environment 1 and SqlContext for environment 2 Hope this will be useful.

Querying on multiple Hive stores using Apache Spark

Tags:

apache-spark

hive

spark-hive

I have a spark application which will successfully connect to hive and query on hive tables using spark engine.

To build this, I just added hive-site.xml to classpath of the application and spark will read the hive-site.xml to connect to its metastore. This method was suggested in spark's mailing list.

So far so good. Now I want to connect to two hive stores and I don't think adding another hive-site.xml to my classpath will be helpful. I referred quite a few articles and spark mailing lists but could not find anyone doing this.

Can someone suggest how I can achieve this?

Thanks.

Docs referred:

Hive on Spark
Spak docs
HiveContext

200

asked Sep 22 '15 10:09

karthik manchala

2 Answers

I think this is possible by making use of Spark SQL capability of connecting and reading data from remote databases using JDBC.

After an exhaustive R & D, I was successfully able to connect to two different hive environments using JDBC and load the hive tables as DataFrames into Spark for further processing.

Environment details

hadoop-2.6.0

apache-hive-2.0.0-bin

spark-1.3.1-bin-hadoop2.6

Code Sample HiveMultiEnvironment.scala

import org.apache.spark.SparkConf import org.apache.spark.sql.SQLContext import org.apache.spark.SparkContext object HiveMultiEnvironment {   def main(args: Array[String]) {     var conf = new SparkConf().setAppName("JDBC").setMaster("local")     var sc = new SparkContext(conf)     var sqlContext = new SQLContext(sc)      // load hive table (or) sub-query from Environment 1      val jdbcDF1 = sqlContext.load("jdbc", Map(       "url" -> "jdbc:hive2://<host1>:10000/<db>",       "dbtable" -> "<db.tablename or subquery>",       "driver" -> "org.apache.hive.jdbc.HiveDriver",       "user" -> "<username>",       "password" -> "<password>"))     jdbcDF1.foreach { println }            // load hive table (or) sub-query from Environment 2      val jdbcDF2 = sqlContext.load("jdbc", Map(       "url" -> "jdbc:hive2://<host2>:10000/<db>",       "dbtable" -> "<db.tablename> or <subquery>",       "driver" -> "org.apache.hive.jdbc.HiveDriver",       "user" -> "<username>",       "password" -> "<password>"))     jdbcDF2.foreach { println }   }   // todo: business logic }

Other parameters can also be set during load using SqlContext such as setting partitionColumn. Details found under 'JDBC To Other Databases' section in Spark reference doc: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html

Build path from Eclipse:

enter image description here

What I Haven't Tried

Use of HiveContext for Environment 1 and SqlContext for environment 2

Hope this will be useful.

146

answered Sep 20 '22 06:09

Aditya

This doesn't seem to be possible in the current version of Spark. Reading the HiveContext code in the Spark Repo it appears that hive.metastore.uris is something that is configurable for many Metastores, but it appears to be used only for redundancy across the same metastore, not totally different metastores.

More information here https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

But you will probably have to aggregate the data somewhere in order to work on it in unison. Or you could create multiple Spark Contexts for each store.

You could try configuring the hive.metastore.uris for multiple different metastores, but it probably won't work. If you do decide to create multiple Spark contexts for each store than make sure you set spark.driver.allowMultipleContexts but this is generally discouraged and may lead to unexpected results.

answered Sep 22 '22 06:09

Stephen Carman

Related questions
                            
                                Pyspark: repartition vs partitionBy
                            
                                How to log using log4j to local file system inside a Spark application that runs on YARN?
                            
                                Perform a typed join in Scala with Spark Datasets
                            
                                Require kryo serialization in Spark (Scala)
                            
                                datetime range filter in PySpark SQL
                            
                                DataFrame / Dataset groupBy behaviour/optimization
                            
                                How to change memory per node for apache spark worker
                            
                                Change Executor Memory (and other configs) for Spark Shell
                            
                                How to convert List to JavaRDD
                            
                                Dealing with unbalanced datasets in Spark MLlib
                            
                                Spark DataFrame - Select n random rows
                            
                                How to create SparkSession from existing SparkContext
                            
                                How to sort an RDD in Scala Spark?
                            
                                map vs mapValues in Spark
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?
                            
                                Replace empty strings with None/null values in DataFrame
                            
                                Increase memory available to PySpark at runtime
                            
                                how to convert json string to dataframe on spark
                            
                                Difference in dense rank and row number in spark
                            
                                How to set Master address for Spark examples from command line

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With