Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

setting SparkContext for pyspark

I am newbie with spark and pyspark. I will appreciate if somebody explain what exactly does SparkContext parameter do? And how could I set spark_context for python application?

like image 855
Dalek Avatar asked Jul 28 '14 13:07

Dalek


People also ask

How do you use SparkContext in PySpark?

SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes.

How do you get SparkContext PySpark?

In Spark/PySpark you can get the current active SparkContext and its configuration settings by accessing spark. sparkContext. getConf. getAll() , here spark is an object of SparkSession and getAll() returns Array[(String, String)] , let's see with examples using Spark with Scala & PySpark (Spark with Python).

What do you mean by PySpark SparkContext?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster. When you create a new SparkContext, at least the master and app name should be set, either through the named parameters here or through conf .


2 Answers

See here: the spark_context represents your interface to a running spark cluster manager. In other words, you will have already defined one or more running environments for spark (see the installation/initialization docs), detailing the nodes to run on etc. You start a spark_context object with a configuration which tells it which environment to use and, for example, the application name. All further interaction, such as loading data, happen as methods of the context object.

For the simple examples and testing, you can run the spark cluster "locally", and skip much of the detail of what is above, e.g.,

./bin/pyspark --master local[4]

will start an interpreter with a context already set to use four threads on your own CPU.

In a standalone app, to be run with sparksubmit:

from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
like image 126
mdurant Avatar answered Oct 25 '22 04:10

mdurant


The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

If you are running pyspark i.e. shell then Spark automatically creates the SparkContext object for you with the name sc. But if you are writing your python program you have to do something like

from pyspark import SparkContext
sc = SparkContext(appName = "test")

Any configuration would go into this spark context object like setting the executer memory or the number of core.

These parameters can also be passed from the shell while invoking for example

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1
lib/spark-examples*.jar \
10

For passing parameters to pyspark use something like this

./bin/pyspark --num-executors 17 --executor-cores 5 --executor-memory 8G
like image 36
iec2011007 Avatar answered Oct 25 '22 04:10

iec2011007