Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Python interact with JVM inside Spark

I am writing Python code to develop some Spark applications. I am really curious how Python interact with running JVM and started reading the source code of Spark.

I can see that in the end, all the Spark transformations/actions ended up be calling certain jvm methods in the following way.

self._jvm.java.util.ArrayList(),
self._jvm.PythonAccumulatorParam(host, port))
self._jvm.org.apache.spark.util.Utils.getLocalDir(self._jsc.sc().conf())
self._jvm.org.apache.spark.util.Utils.createTempDir(local_dir, "pyspark") \
            .getAbsolutePath()
...

As a Python programmer, I am really curious what is going on with this _jvm object. However, I have briefly read all the source code under pyspark and only found _jvm to be an attribute of Context class, beyond that, I know nothing about neither _jvm's attributes nor methods.

Can anyone help me understand how pyspark translate into JVM operations? should I read some scala code and see if _jvm is defined there?

like image 968
B.Mr.W. Avatar asked Apr 22 '15 05:04

B.Mr.W.


People also ask

How does Python work with Spark?

Spark comes with an interactive python shell. The PySpark shell is responsible for linking the python API to the spark core and initializing the spark context. bin/PySpark command will launch the Python interpreter to run PySpark application. PySpark can be launched directly from the command line for interactive use.

How does PySpark work with JVM?

PySpark uses Py4J, which is a framework that facilitates interoperation between the two languages, to exchange data between the Python and the JVM processes. When you launch a PySpark job, it starts as a Python process, which then spawns a JVM instance and runs some PySpark specific code in it.

Does PySpark run on JVM?

Overview. PySpark is built on top of Spark's Java API. Data is processed in Python and cached / shuffled in the JVM: In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.

How do I connect Spark to Python?

Standalone PySpark applications should be run using the bin/pyspark script, which automatically configures the Java and Python environment using the settings in conf/spark-env.sh or . cmd . The script automatically adds the bin/pyspark package to the PYTHONPATH .


1 Answers

It uses py4j. There is a special protocol to translate python calls into JVM calls. All of this you can find in Pyspark code, see java_gateway.py

like image 63
artemdevel Avatar answered Oct 19 '22 03:10

artemdevel