Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Spark interoperate with CPython

I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.

like image 690
Arne Claassen Avatar asked Jun 06 '15 16:06

Arne Claassen


People also ask

Does PySpark use Jython?

PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython.

How does Spark run Python code?

Spark comes with an interactive python shell. The PySpark shell is responsible for linking the python API to the spark core and initializing the spark context. bin/PySpark command will launch the Python interpreter to run PySpark application. PySpark can be launched directly from the command line for interactive use.

How does Python connect to Spark?

Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or . cmd . The script automatically adds the bin/pyspark package to the PYTHONPATH .


1 Answers

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

PySpark internals

As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)

The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.

Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:

  • Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.

  • Scala code that starts python process and iterates with it: api/python/PythonRDD.scala

  • SerDeser utils that do picking of the code: api/python/SerDeUtil.scala

  • Python side: python/pyspark/worker.py

like image 176
vvladymyrov Avatar answered Sep 17 '22 20:09

vvladymyrov