Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkSession initialization error - Unable to use spark.read

I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *

conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext  = SQLContext(sc)

spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)

dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")

And I get the error:

dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False) AttributeError: 'Builder' object has no attribute 'read'

Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?

like image 879
Michail N Avatar asked Oct 24 '17 08:10

Michail N


People also ask

What is SparkSession in Spark?

SparkSession Encapsulates SparkContextIt allows you to configure Spark configuration parameters. And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.

Should I use SparkSession or SparkContext?

As a result, when comparing SparkSession vs SparkContext, as of Spark 2.0. 0, it is better to use SparkSession because it provides access to all of the Spark features that the other three APIs do.

What is the difference between SQLContext and SparkSession?

In Spark, SparkSession is an entry point to the Spark application and SQLContext is used to process structured data that contains rows and columns Here, I will mainly focus on explaining the difference between SparkSession and SQLContext by defining and describing how to create these two.

What is SparkContext vs SparkSession?

SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.


1 Answers

There is no need to use both SparkContext and SparkSession to initialize Spark. SparkSession is the newer, recommended way to use.

To initialize your environment, simply do:

spark = SparkSession\
  .builder\
  .appName("test_import")\
  .getOrCreate()

You can run SQL commands by doing:

spark.sql(...)

Prior to Spark 2.0.0, three separate objects were used: SparkContext, SQLContext and HiveContext. These were used separatly depending on what you wanted to do and the data types used.

With the intruduction of the Dataset/DataFrame abstractions, the SparkSession object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession (say in a variable named spark) and then do spark.sparkContext/spark.sqlContext.

like image 97
Shaido Avatar answered Oct 28 '22 04:10

Shaido