Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Python to Scala translation

If I got it right Apache YARN receives Application Master and Node Manager as JAR files. They executed as Java process on the nodes of the YARN cluster. When I write a Spark program using Python, Does it compiled into JAR somehow? If not how come Spark is able to execute Python logic on YARN cluster nodes?

like image 305
Rtik88 Avatar asked Dec 14 '25 17:12

Rtik88


1 Answers

The PySpark driver program uses Py4J (http://py4j.sourceforge.net/) to launch a JVM and create a Spark Context. Spark RDD operations written in Python are mapped to operations on PythonRDD.

On the remote workers, PythonRDD launches sub-processes which run Python. The data and code is passed from the Remote Worker's JVM to its Python sub-process using pipes.

Therefore, it is necessary for your YARN nodes to have python installed for this to work.

The python code is not compiled to a JAR, but is distributed around the cluster using Spark. In order to make this possible, user functions written in Python are pickled using the following code https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py

Source: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

like image 111
mattinbits Avatar answered Dec 16 '25 14:12

mattinbits



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!