Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.

My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with

sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate().

I can read in the avros with

mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...

and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming

1.) the sqlContext alias is a bad idea

and

2.) I need to properly set up a sparkSession.

So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.

like image 859
haileyeve Avatar asked Sep 29 '16 22:09

haileyeve


People also ask

What is Spark 2.0 SparkSession?

SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet.


2 Answers

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate() 

now to import some .csv file you can use

df=spark.read.csv('filename.csv',header=True) 
like image 94
Csaxena Avatar answered Oct 04 '22 12:10

Csaxena


As you can see in the scala example, Spark Session is part of sql module. Similar in python. hence, see pyspark sql module documentation

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None) The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

>>> spark = SparkSession.builder \ ...     .master("local") \ ...     .appName("Word Count") \ ...     .config("spark.some.config.option", "some-value") \ ...     .getOrCreate() 
like image 20
Ayan Guha Avatar answered Oct 04 '22 12:10

Ayan Guha