Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are SparkSession Config Options

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.

 val spark = SparkSession
   .builder()
   .appName("jsonReaderApp")
   .config("config.key.here", configValueHere)
   .enableHiveSupport()
   .getOrCreate()
val jread = spark.read.json("search-results1.json")

I am very new to spark and do not know what to use for config.key.here and configValueHere.

like image 712
Sha2b Avatar asked Mar 26 '17 03:03

Sha2b


People also ask

What is SparkSession config?

SparkSession Encapsulates SparkContextIt allows you to configure Spark configuration parameters. And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.

What is SparkSession used for?

The SparkSession is used to create and read DataFrames. It's used whenever you create a DataFrame in your test suite or whenever you read a Parquet / CSV data lake into a DataFrame. This post explains how to create a SparkSession, share it throughout your program, and use it to create DataFrames.

What is the difference between SparkConf and SparkSession?

SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.

What is appName in SparkSession?

appName (name) Sets a name for the application, which will be shown in the Spark web UI. If no application name is set, a randomly generated name will be used. New in version 2.0.


3 Answers

SparkSession

To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).

import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()

or without importing SparkConf:

spark.sparkContext.getConf().getAll()

Depending on which API you are using, see one of the following:

  1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
  2. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
  3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html

You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.

spark.sparkContext._conf.getAll()  

SparkContext

To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.

import pyspark
from pyspark import SparkConf, SparkContext 
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()

Depending on which API you are using, see one of the following:

  1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
  2. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
  3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html

Spark parameters

You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
 ...
 ...
 (u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]

Depending on which API you are using, see one of the following:

  1. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
  2. https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
  3. https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html

For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties

Setting Spark parameters

Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:

SparkSession

spark = (
    SparkSession
    .builder
    .appName("Your App Name")
    .config("spark.some.config.option1", "some-value")
    .config("spark.some.config.option2", "some-value")
    .getOrCreate())

sc = spark.sparkContext

SparkContext

spark_conf = (
    SparkConf()
    .setAppName("Your App Name")
    .set("spark.some.config.option1", "some-value")
    .set("spark.some.config.option2", "some-value"))

sc = SparkContext(conf = spark_conf)

spark-defaults

You can also set the Spark parameters in a spark-defaults.conf file:

spark.some.config.option1 some-value
spark.some.config.option2 "some-value"

then run your Spark application with spark-submit (pyspark):

spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
like image 85
Clay Avatar answered Oct 11 '22 10:10

Clay


This is how it worked for me to add spark or hive settings in my scala:

{
    val spark = SparkSession
        .builder()
        .appName("StructStreaming")
        .master("yarn")
        .config("hive.merge.mapfiles", "false")
        .config("hive.merge.tezfiles", "false")
        .config("parquet.enable.summary-metadata", "false")
        .config("spark.sql.parquet.mergeSchema","false")
        .config("hive.merge.smallfiles.avgsize", "160000000")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .config("spark.sql.orc.impl", "native")
        .config("spark.sql.parquet.binaryAsString","true")
        .config("spark.sql.parquet.writeLegacyFormat","true")
        //.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
        .getOrCreate()
}
like image 21
Jeff A. Avatar answered Oct 11 '22 11:10

Jeff A.


The easiest way to set some config:

spark.conf.set("spark.sql.shuffle.partitions", 500).

Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.

like image 3
kar09 Avatar answered Oct 11 '22 09:10

kar09