Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sparkSession/sparkContext can not get hadoop configuration

I am running spark 2, hive, hadoop at local machine, and I want to use spark sql to read data from hive table.

It works all fine when I have hadoop running at default hdfs://localhost:9000, but if I change to a different port in core-site.xml:

<name>fs.defaultFS</name>
<value>hdfs://localhost:9099</value>

Running a simple sql spark.sql("select * from archive.tcsv3 limit 100").show(); in spark-shell will give me the error:

ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
.....
From local/147.214.109.160 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused;
.....

I get the AlreadyExistsException before, which doesn't seem to influence the result.

I can make it work by creating a new sparkContext:

import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
sc.stop()
var sc = new SparkContext()
val session = SparkSession.builder().master("local").appName("test").enableHiveSupport().getOrCreate()
session.sql("show tables").show()

My question is, why the initial sparkSession/sparkContext did not get the correct configuration? How can I fix it? Thanks!

like image 334
xubuild Avatar asked Aug 19 '16 11:08

xubuild


People also ask

Can we access the SparkContext via a SparkSession?

All functionality available with SparkContext is also available in SparkSession. Also, it provides APIs to work on DataFrames and Datasets.

How is SparkSession different from SparkContext?

In earlier versions of Spark or Pyspark, SparkContext was an entry point for programming with RDD and connecting to Spark Cluster. With the introduction of Spark 2.0 SparkSession, it became an entry point for programming with DataFrame and Dataset.

What is SparkSession config?

SparkSession Encapsulates SparkContextIt allows you to configure Spark configuration parameters. And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.


1 Answers

If you are using SparkSession and you want to set configuration on the the spark context then use session.sparkContext

val session = SparkSession
  .builder()
  .appName("test")
  .enableHiveSupport()
  .getOrCreate()
import session.implicits._

session.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")

You don't need to import SparkContext or created it before the SparkSession

like image 73
Jeremy Sanecki Avatar answered Sep 28 '22 06:09

Jeremy Sanecki