I am trying to run a python wordcount on a Spark HDInsight cluster and I'm running it from Jupyter. I'm not actually sure if this is the right way to do it, but I couldn't find anything helpful about how to submit a standalone python app on HDInsight Spark cluster.
The code :
import pyspark
import operator
from pyspark import SparkConf
from pyspark import SparkContext
import atexit
from operator import add
conf = SparkConf().setMaster("yarn-client").setAppName("WC")
sc = SparkContext(conf = conf)
atexit.register(lambda: sc.stop())
input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
words = input.flatMap(lambda x: x.split())
wordCount = words.map(lambda x: (str(x),1)).reduceByKey(add)
wordCount.saveAsTextFile("wasb:///example/outputspark")
And the error message I get and don't understand :
ValueError Traceback (most recent call last)
<ipython-input-2-8a9d4f2cb5e8> in <module>()
6 from operator import add
7 import atexit
----> 8 sc = SparkContext('yarn-client')
9
10 input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
/usr/hdp/current/spark-client/python/pyspark/context.pyc in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
108 """
109 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 110 SparkContext._ensure_initialized(self, gateway=gateway)
111 try:
112 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/hdp/current/spark-client/python/pyspark/context.pyc in _ensure_initialized(cls, instance, gateway)
248 " created by %s at %s:%s "
249 % (currentAppName, currentMaster,
--> 250 callsite.function, callsite.file, callsite.linenum))
251 else:
252 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=yarn-client) created by __init__ at <ipython-input-1-86beedbc8a46>:7
Is it actually possible to run python job this way? If yes - it seems to be the problem with SparkContext definition... I tried different ways:
sc = SparkContext('spark://headnodehost:7077', 'pyspark')
and
conf = SparkConf().setMaster("yarn-client").setAppName("WordCount1")
sc = SparkContext(conf = conf)
but no success. What would be the right way to run the job or configure SparkContext?
If you running from Jupyter notebook than Spark context is pre-created for you and it would be incorrect to create a separate context. To resolve the problem just remove the lines that create the context and directly start from:
input = sc.textFile("wasb:///example/data/gutenberg/davinci.txt")
If you need to run standalone program you can run it from command line using pyspark or submit it using REST APIs using Livy server running on the cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With