I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language. My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with <code>sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate()</code>. I can read in the avros with <code>mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...</code> and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming 1.) the sqlContext alias is a bad idea and 2.) I need to properly set up a sparkSession. So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.

<pre class="prettyprint"><code>from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate() </code></pre> now to import some .csv file you can use <pre class="prettyprint"><code>df=spark.read.csv('filename.csv',header=True) </code></pre>

How to build a sparkSession in Spark 2.0 using pyspark?

Tags:

python

sql

apache-spark

pyspark

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.

My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with

sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate().

I can read in the avros with

mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...

and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming

1.) the sqlContext alias is a bad idea

and

2.) I need to properly set up a sparkSession.

So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.

859

asked Sep 29 '16 22:09

haileyeve

2 Answers

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate()

now to import some .csv file you can use

df=spark.read.csv('filename.csv',header=True)

answered Oct 04 '22 12:10

Csaxena

As you can see in the scala example, Spark Session is part of sql module. Similar in python. hence, see pyspark sql module documentation

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None) The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

>>> spark = SparkSession.builder \ ...     .master("local") \ ...     .appName("Word Count") \ ...     .config("spark.some.config.option", "some-value") \ ...     .getOrCreate()

answered Oct 04 '22 12:10

Ayan Guha

Related questions
                            
                                Spark union of multiple RDDs
                            
                                Is Python type safe?
                            
                                Python logging in Django
                            
                                Remove and ignore all files that have an extension from a git repository
                            
                                How to create a new unknown or dynamic/expando object in Python
                            
                                How can I convert a python urandom to a string?
                            
                                How to convert string values from a dictionary, into int/float datatypes?
                            
                                What is this kind of assignment in Python called? a = b = True
                            
                                Python and Powers Math
                            
                                Delete all objects in a list
                            
                                How to get all combination of n binary value? [duplicate]
                            
                                Pandas: Why are double brackets needed to select column after boolean indexing
                            
                                Is there a more Pythonic way to combine an Else: statement and an Except:?
                            
                                add a row at top in pandas dataframe [duplicate]
                            
                                Python Disk-Based Dictionary
                            
                                How to get output from subprocess.Popen(). proc.stdout.readline() blocks, no data prints out
                            
                                How to write an empty indentation block in Python?
                            
                                Use of True, False, and None as return values in Python functions
                            
                                How to extract text and text coordinates from a PDF file?
                            
                                Making a chart bigger in size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With