I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.
My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with
sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate()
.
I can read in the avros with
mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...
and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming
1.) the sqlContext alias is a bad idea
and
2.) I need to properly set up a sparkSession.
So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.
SparkSession was introduced in version Spark 2.0, It is an entry point to underlying Spark functionality in order to programmatically create Spark RDD, DataFrame, and DataSet.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('abc').getOrCreate()
now to import some .csv file you can use
df=spark.read.csv('filename.csv',header=True)
As you can see in the scala example, Spark Session is part of sql module. Similar in python. hence, see pyspark sql module documentation
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None) The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:
>>> spark = SparkSession.builder \ ... .master("local") \ ... .appName("Word Count") \ ... .config("spark.some.config.option", "some-value") \ ... .getOrCreate()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With