I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)
dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")
And I get the error:
dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False) AttributeError: 'Builder' object has no attribute 'read'
Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?
SparkSession Encapsulates SparkContextIt allows you to configure Spark configuration parameters. And through SparkContext, the driver can access other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
As a result, when comparing SparkSession vs SparkContext, as of Spark 2.0. 0, it is better to use SparkSession because it provides access to all of the Spark features that the other three APIs do.
In Spark, SparkSession is an entry point to the Spark application and SQLContext is used to process structured data that contains rows and columns Here, I will mainly focus on explaining the difference between SparkSession and SQLContext by defining and describing how to create these two.
SparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset.
There is no need to use both SparkContext
and SparkSession
to initialize Spark. SparkSession
is the newer, recommended way to use.
To initialize your environment, simply do:
spark = SparkSession\
.builder\
.appName("test_import")\
.getOrCreate()
You can run SQL commands by doing:
spark.sql(...)
Prior to Spark 2.0.0, three separate objects were used: SparkContext
, SQLContext
and HiveContext
. These were used separatly depending on what you wanted to do and the data types used.
With the intruduction of the Dataset/DataFrame abstractions, the SparkSession
object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession
(say in a variable named spark
) and then do spark.sparkContext
/spark.sqlContext
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With