I am newbie with <code>spark</code> and <code>pyspark</code>. I will appreciate if somebody explain what exactly does <code>SparkContext</code> parameter do? And how could I set <code>spark_context</code> for python application?

See here: the spark_context represents your interface to a running spark cluster manager. In other words, you will have already defined one or more running environments for spark (see the installation/initialization docs), detailing the nodes to run on etc. You start a spark_context object with a configuration which tells it which environment to use and, for example, the application name. All further interaction, such as loading data, happen as methods of the context object. For the simple examples and testing, you can run the spark cluster "locally", and skip much of the detail of what is above, e.g., <pre class="prettyprint"><code>./bin/pyspark --master local[4] </code></pre> will start an interpreter with a context already set to use four threads on your own CPU. In a standalone app, to be run with sparksubmit: <pre class="prettyprint"><code>from pyspark import SparkContext sc = SparkContext("local", "Simple App") </code></pre>

<blockquote> The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. </blockquote> If you are running pyspark i.e. shell then Spark automatically creates the SparkContext object for you with the name <code>sc</code>. But if you are writing your python program you have to do something like <pre class="prettyprint"><code>from pyspark import SparkContext sc = SparkContext(appName = "test") </code></pre> Any configuration would go into this spark context object like setting the executer memory or the number of core. These parameters can also be passed from the shell while invoking for example <pre class="prettyprint"><code>./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 lib/spark-examples*.jar \ 10 </code></pre> For passing parameters to pyspark use something like this <pre class="prettyprint"><code>./bin/pyspark --num-executors 17 --executor-cores 5 --executor-memory 8G </code></pre>

setting SparkContext for pyspark

2 Answers

See here: the spark_context represents your interface to a running spark cluster manager. In other words, you will have already defined one or more running environments for spark (see the installation/initialization docs), detailing the nodes to run on etc. You start a spark_context object with a configuration which tells it which environment to use and, for example, the application name. All further interaction, such as loading data, happen as methods of the context object.

For the simple examples and testing, you can run the spark cluster "locally", and skip much of the detail of what is above, e.g.,

Click to copy

./bin/pyspark --master local[4]

will start an interpreter with a context already set to use four threads on your own CPU.

In a standalone app, to be run with sparksubmit:

Click to copy

from pyspark import SparkContext
sc = SparkContext("local", "Simple App")

126

answered Oct 25 '22 04:10

mdurant

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.

If you are running pyspark i.e. shell then Spark automatically creates the SparkContext object for you with the name sc. But if you are writing your python program you have to do something like

Click to copy

from pyspark import SparkContext
sc = SparkContext(appName = "test")

Any configuration would go into this spark context object like setting the executer memory or the number of core.

These parameters can also be passed from the shell while invoking for example

Click to copy

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1
lib/spark-examples*.jar \
10

For passing parameters to pyspark use something like this

Click to copy

./bin/pyspark --num-executors 17 --executor-cores 5 --executor-memory 8G

answered Oct 25 '22 04:10

iec2011007

Related questions
                            
                                django is very slow on my machine
                            
                                why is __init__ module in django project loaded twice
                            
                                What's the best practice for handling single-value tuples in Python?
                            
                                Sort a list of dicts by dict values
                            
                                Running python on a Windows machine vs Linux
                            
                                basics of python encryption w/ hashlib sha1
                            
                                Unable to install Python and GDAL (DLL load failed)
                            
                                Matplotlib : Comma separated number format for axis
                            
                                python: how to tell if file executed as import vs. main script?
                            
                                Unable to encode/decode pprint output
                            
                                Turn off the upper/right axis tick marks
                            
                                python closure with assigning outer variable inside inner function
                            
                                MongoDB insert raises duplicate key error
                            
                                How to my "exe" from PyCharm project [duplicate]
                            
                                cx_Oracle: ImportError: DLL load failed: This application has failed
                            
                                Writing a csv file into SQL Server database using python
                            
                                Pre-allocating a list of None
                            
                                Python Selenium Webdriver - Try except loop
                            
                                Stop nosetests from printing logging information?
                            
                                matplotlib imshow plots different if using colormap or RGB array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

setting SparkContext for pyspark

Tags:

python

apache-spark

pyspark

Dalek

People also ask

2 Answers

mdurant

iec2011007

Recent Activity

Donate For Us