I have an Akka system written in <code>scala</code> that needs to call out to some <code>Python</code> code, relying on <code>Pandas</code> and <code>Numpy</code>, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. <img src="https://i.stack.imgur.com/mJ9IE.jpg" alt="PySpark internals"> As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there) The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them. Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started: <ul> <li>Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.</li> <li>Scala code that starts python process and iterates with it: api/python/PythonRDD.scala</li> <li>SerDeser utils that do picking of the code: api/python/SerDeUtil.scala</li> <li>Python side: python/pyspark/worker.py</li> </ul>

How does Spark interoperate with CPython

Tags:

pandas

interop

scala

apache-spark

pyspark

I have an Akka system written in scala that needs to call out to some Python code, relying on Pandas and Numpy, so I can't just use Jython. I noticed that Spark uses CPython on its worker nodes, so I'm curious how it executes Python code and whether that code exists in some re-usable form.

690

asked Jun 06 '15 16:06

Arne Claassen

1 Answers

PySpark architecture is described here https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals.

PySpark internals

As @Holden said Spark uses py4j to access Java objects in JVM from the python. But this is only one case - when driver program is written in python (left part of diagram there)

The other case (the right part of the diagram) - when Spark Worker starts Python process and sends serialized Java objects to python program to be processed, and receives output. Java objects are serialized into pickle format - so python could read them.

Looks like what you are looking for is the latter case. Here some links to the Spark's scala core that could be useful for you to get started:

Pyrolite library that provides Java interface to Python's pickle protocols - used by Spark to serialize Java objects into pickle format. For example such conversion is required for accessing Key part of Key, Value pairs for the PairRDD.
Scala code that starts python process and iterates with it: api/python/PythonRDD.scala
SerDeser utils that do picking of the code: api/python/SerDeUtil.scala
Python side: python/pyspark/worker.py

176

answered Sep 17 '22 20:09

vvladymyrov

Related questions
                            
                                Applying function to Spark Dataframe Column
                            
                                Read from a hive table and write back to it using spark sql
                            
                                My API is all returning Future[Option[T]], how to combine them nicely in a for-compr
                            
                                Error while exploding a struct column in Spark
                            
                                Does Scala AnyRef.clone perform a shallow or deep copy?
                            
                                Int vs Integer: type mismatch, found: Int, required: String
                            
                                Reverse / transpose a one-to-many map in Scala
                            
                                Case classes, pattern matching and curried constructors in Scala
                            
                                What is the reason behind the `=>` in a self type?
                            
                                Testing Akka actors that mixin Stash with TestActorRef
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                What is the easiest way to deeply clone (copy) a mutable Scala object?
                            
                                Referring to the type of an inner class in Scala
                            
                                How can Scala actors return a value in response to a message?
                            
                                How do you implement @BeforeClass semantics in a JUnit 4 test written in scala?
                            
                                How to make sbt `console` use -Yrepl-sync?
                            
                                Scala streams and their memory usage
                            
                                Deep-reverse of nested lists in Scala
                            
                                Why can auxiliary constructors in Scala only consist of a single call to another constructor?
                            
                                Spark: Repartition strategy after reading text file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With