Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to run python script in spark job?

I have setup spark on 3 machines using tar file method. I have not done any advanced configuration, I have edited slaves file and started master and workers. I am able to see sparkUI on 8080 port. Now I want to run simple python script on spark cluster.

import sys
from random import random
from operator import add

from pyspark import SparkContext


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    sc = SparkContext(appName="PythonPi")
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 < 1 else 0

    count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
    print "Pi is roughly %f" % (4.0 * count / n)

    sc.stop()

I am running this command

spark-submit --master spark://IP:7077 pi.py 1

But getting following error

14/12/22 18:31:23 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/12/22 18:31:38 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:31:43 INFO client.AppClient$ClientActor: Connecting to master spark://10.77.36.243:7077...
14/12/22 18:31:53 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:32:03 INFO client.AppClient$ClientActor: Connecting to master spark://10.77.36.243:7077...
14/12/22 18:32:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
14/12/22 18:32:23 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
14/12/22 18:32:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/12/22 18:32:23 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
14/12/22 18:32:23 INFO scheduler.DAGScheduler: Failed to run reduce at /opt/pi.py:21
Traceback (most recent call last):
  File "/opt/pi.py", line 21, in <module>
    count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
  File "/usr/local/spark/python/pyspark/rdd.py", line 759, in reduce
    vals = self.mapPartitions(func).collect()
  File "/usr/local/spark/python/pyspark/rdd.py", line 723, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up.
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

does anyone facing same issue. Plz help in this.

like image 736
sandip divekar Avatar asked Dec 22 '14 13:12

sandip divekar


People also ask

How do I run a spark application in Python?

Spark Python Application. Apache Spark provides APIs for many popular programming languages. Python is on of them. One can write a python script for Apache Spark and run it using spark-submit command line interface.

Is it possible to write Python script in Apache Spark?

Apache Spark provides APIs for many popular programming languages. Python is on of them. One can write a python script for Apache Spark and run it using spark-submit command line interface.

How to create an Apache Spark job definition for pyspark (Python)?

In this section, you create an Apache Spark job definition for PySpark (Python). Open Synapse Studio. You can go to Sample files for creating Apache Spark job definitions to download sample files for python.zip, then unzip the compressed package, and extract the wordcount.py and shakespeare.txt files.

How to submit a pyspark application using spark-submit?

When you wanted to spark-submit a PySpark application (Spark with Python), you need to specify the .py file you wanted to run and specify the .egg file or .zip file for dependency libraries.


1 Answers

This:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

suggests that the cluster does not have any available resources.

Check the status of your cluster and inspect cores and RAM (http://www.datastax.com/dev/blog/common-spark-troubleshooting).

Also, double check your IP addresses.

For more ideas: Running a Job on Spark 0.9.0 throws error

like image 160
Def_Os Avatar answered Oct 05 '22 23:10

Def_Os