I have a scenario to compare two different tables source and destination from two separate remote hive servers, can we able to use two SparkSessions something like I tried below:-
val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder()
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.42:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.42:9083")
.enableHiveSupport()
.getOrCreate()
I tried with SparkSession.clearActiveSession() and SparkSession.clearDefaultSession() but it isn't working, throwing the error below:
Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
is there any other way we can achieve accessing two hive tables using multiple SparkSessions or SparkContext.
Thanks
Spark applications can use multiple sessions to use different underlying data catalogs. You can use an existing Spark session to create a new session by calling the newSession method.
Note: we can have multiple spark contexts by setting spark. driver. allowMultipleContexts to true . But having multiple spark contexts in the same jvm is not encouraged and is not considered as a good practice as it makes it more unstable and crashing of 1 spark context can affect the other.
You can only have one SparkContext at one time.
Speed: – The operations in Hive are slower than Apache Spark in terms of memory and disk processing as Hive runs on top of Hadoop. Read/Write operations: – The number of read/write operations in Hive are greater than in Apache Spark. This is because Spark performs its intermediate operations in memory itself.
I use this way and working perfectly fine with Spark 2.1
val sc = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz1111:10000")
.enableHiveSupport()
.getOrCreate()
// Createdataframe 1 from by reading the data from hive table of metstore 1
val dataframe_1 = sc.sql("select * from <SourcetbaleofMetaStore_1>")
// Resetting the existing Spark Contexts
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
//Initialize Spark session2 with Hive Metastore 2
val spc2 = SparkSession.builder()
.config("hive.metastore.uris", "thrift://dbsyz2222:10004")
.enableHiveSupport()
.getOrCreate()
// Load dataframe 2 of spark context 1 into a new dataframe of spark context2, By getting schema and data by converting to rdd API
val dataframe_2 = spc2.createDataFrame(dataframe_1.rdd, dataframe_1.schema)
dataframe_2.write.mode("Append").saveAsTable(<targettableNameofMetastore_2>)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With