How to connect to remote hive server from spark [duplicate]

Tags:

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster.

I'm able to access the hive tables by lauching beeline under SPARK_HOME

[ml@master spark-2.0.0]$./bin/beeline 
Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://remote_hive:10000
Connecting to jdbc:hive2://remote_hive:10000
Enter username for jdbc:hive2://remote_hive:10000: root
Enter password for jdbc:hive2://remote_hive:10000: ******
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ml/spark/spark-2.0.0/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/12 19:06:39 INFO jdbc.Utils: Supplied authorities: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.Utils: Resolved authority: remote_hive:10000
16/10/12 19:06:39 INFO jdbc.HiveConnection: Will try to open client transport with JDBC Uri: jdbc:hive2://remote_hive:10000
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://remote_hive:10000>

how can I access the remote hive tables programmatically from spark?

864

asked Oct 12 '16 11:10

April

2 Answers

JDBC is not required

Spark connects directly to the Hive metastore, not through HiveServer2. To configure this,

Put hive-site.xml on your classpath, and specify hive.metastore.uris to where your hive metastore hosted. Also see How to connect to a Hive metastore programmatically in SparkSQL?
Import org.apache.spark.sql.hive.HiveContext, as it can perform SQL query over Hive tables.
Define val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
Verify sqlContext.sql("show tables") to see if it works

SparkSQL on Hive tables

Conclusion : If you must go with jdbc way

Have a look connecting apache spark with apache hive remotely.

Please note that beeline also connects through jdbc. from your log it self its evident.

[ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2 by Apache Hive beeline> !connect jdbc:hive2://remote_hive:10000

Connecting to jdbc:hive2://remote_hive:10000

So please have a look at this interesting article

Method 1: Pull table into Spark using JDBC
Method 2: Use Spark JdbcRDD with HiveServer2 JDBC driver
Method 3: Fetch dataset on a client side, then create RDD manually

Currently HiveServer2 driver doesn't allow us to use "Sparkling" Method 1 and 2, we can rely only on Method 3

Below is example code snippet though which it can be achieved

Loading data from one Hadoop cluster (aka "remote") into another one (where my Spark lives aka "domestic") thru HiveServer2 JDBC connection.

import java.sql.Timestamp
import scala.collection.mutable.MutableList

case class StatsRec (
  first_name: String,
  last_name: String,
  action_dtm: Timestamp,
  size: Long,
  size_p: Long,
  size_d: Long
)

val conn: Connection = DriverManager.getConnection(url, user, password)
val res: ResultSet = conn.createStatement
                   .executeQuery("SELECT * FROM stats_201512301914")
val fetchedRes = MutableList[StatsRec]()
while(res.next()) {
  var rec = StatsRec(res.getString("first_name"), 
     res.getString("last_name"), 
     Timestamp.valueOf(res.getString("action_dtm")), 
     res.getLong("size"), 
     res.getLong("size_p"), 
     res.getLong("size_d"))
  fetchedRes += rec
}
conn.close()
val rddStatsDelta = sc.parallelize(fetchedRes)
rddStatsDelta.cache()




 // Basically we are done. To check loaded data:

println(rddStatsDelta.count)
rddStatsDelta.collect.take(10).foreach(println)

answered Nov 16 '22 09:11

Ram Ghadiyaram

After providing the hive-ste.xml configuration to SPARK and after starting the HIVE Metastore service,

Two things need to be configured in SPARK Session while connecting to HIVE:

Since Spark SQL connects to Hive metastore using thrift, we need to provide the thrift server uri while creating the Spark session.
Hive Metastore warehouse which is the directory where Spark SQL persists tables. Use Property 'spark.sql.warehouse.dir' which is corresponding to 'hive.metastore.warehouse.dir' (as this is deprecated in Spark 2.0)

Something like:

    SparkSession spark=SparkSession.builder().appName("Spark_SQL_5_Save To Hive").enableHiveSupport().getOrCreate();
    spark.sparkContext().conf().set("spark.sql.warehouse.dir", "/user/hive/warehouse");
    spark.sparkContext().conf().set("hive.metastore.uris", "thrift://localhost:9083");

Hope this was helpful !!

answered Nov 16 '22 10:11

Amardeep Kohli

Related questions
                            
                                argmax in Spark DataFrames: how to retrieve the row with the maximum value
                            
                                How can I save an RDD into HDFS and later read it back?
                            
                                How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0
                            
                                How to create a copy of a dataframe in pyspark?
                            
                                Encountering " WARN ProcfsMetricsGetter: Exception when trying to compute pagesize" error when running Spark
                            
                                Is there an "Explain RDD" in spark
                            
                                How to extract application ID from the PySpark context
                            
                                Case class equality in Apache Spark
                            
                                How to connect HBase and Spark using Python?
                            
                                Writing files to local system with Spark in Cluster mode
                            
                                How to filter one spark dataframe against another dataframe
                            
                                How do I collect a single column in Spark?
                            
                                How to set the number of partitions/nodes when importing data into Spark
                            
                                Spark Error: Not enough space to cache partition rdd_8_2 in memory! Free memory is 58905314 bytes
                            
                                Spark when union a lot of RDD throws stack overflow error
                            
                                Spark SQL filter multiple fields
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to connect to remote hive server from spark [duplicate]

Tags:

apache-spark

apache-spark-sql

hive

spark-thriftserver