Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between spark-submit and pyspark?

Tags:

If I start up pyspark and then run this command:

import my_script; spark = my_script.Sparker(sc); spark.collapse('./data/')

Everything is A-ok. If, however, I try to do the same thing through the commandline and spark-submit, I get an error:

Command: /usr/local/spark/bin/spark-submit my_script.py collapse ./data/
  File "/usr/local/spark/python/pyspark/rdd.py", line 352, in func
    return f(iterator)
  File "/usr/local/spark/python/pyspark/rdd.py", line 1576, in combineLocally
    merger.mergeValues(iterator)
  File "/usr/local/spark/python/pyspark/shuffle.py", line 245, in mergeValues
    for k, v in iterator:
  File "/.../my_script.py", line 173, in _json_args_to_arr
    js = cls._json(line)
RuntimeError: uninitialized staticmethod object

my_script:

...
if __name__ == "__main__":
    args = sys.argv[1:]
    if args[0] == 'collapse':
        directory = args[1]
        from pyspark import SparkContext
        sc = SparkContext(appName="Collapse")
        spark = Sparker(sc)
        spark.collapse(directory)
        sc.stop()

Why is this happening? What's the difference between running pyspark and running spark-submit that would cause this divergence? And how can I make this work in spark-submit?

EDIT: I tried running this from the bash shell by doing pyspark my_script.py collapse ./data/ and I got the same error. The only time when everything works is when I am in a python shell and import the script.

like image 880
user592419 Avatar asked Nov 04 '14 02:11

user592419


People also ask

What is Spark submit PySpark?

The spark-submit command is a utility to run or submit a Spark or PySpark application program (or job) to the cluster by specifying options and configurations, the application you are submitting can be written in Scala, Java, or Python (PySpark).

What is Spark submit used for?

The spark-submit script in Spark's bin directory is used to launch applications on a cluster. It can use all of Spark's supported cluster managers through a uniform interface so you don't have to configure your application especially for each one.

Which is better Spark or PySpark?

Spark is an awesome framework and the Scala and Python APIs are both great for most workflows. PySpark is more popular because Python is the most popular language in the data community. PySpark is a well supported, first class Spark API, and is a great choice for most organizations.


2 Answers

  1. If you built a spark application, you need to use spark-submit to run the application

    • The code can be written either in python/scala

    • The mode can be either local/cluster

  2. If you just want to test/run few individual commands, you can use the shell provided by spark

    • pyspark (for spark in python)
    • spark-shell (for spark in scala)
like image 180
avrsanjay Avatar answered Oct 01 '22 05:10

avrsanjay


pyspark command is REPL (read–eval–print loop) which is used to start an interactive shell to test few PySpark commands. This is used during development time. We are talking about Python here.

To run spark application written in Scala or Python on a cluster or locally, you can use spark-submit.

like image 39
Sharhabeel Hamdan Avatar answered Oct 01 '22 05:10

Sharhabeel Hamdan