Is that possible to pipe Spark RDD to Python? Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?

You can indeed pipe out to a python script using Scala and Spark and a regular Python script. test.py <pre class="prettyprint"><code>#!/usr/bin/python import sys for line in sys.stdin: print "hello " + line </code></pre> spark-shell (scala) <pre class="prettyprint"><code>val data = List("john","paul","george","ringo") val dataRDD = sc.makeRDD(data) val scriptPath = "./test.py" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.foreach(println) </code></pre> Output hello john hello ringo hello george hello paul

You can run the Python code via Pipe in Spark. With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output. SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node. Example : Python File : Code for making Input data to Uppercase <pre class="prettyprint"><code>#!/usr/bin/python import sys for line in sys.stdin: print line.upper() </code></pre> Spark Code : For Piping the data <pre class="prettyprint"><code>val conf = new SparkConf().setAppName("Pipe") val sc = new SparkContext(conf) val distScript = "/path/on/driver/PipeScript.py" val distScriptName = "PipeScript.py" sc.addFile(distScript) val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf")) val opData = ipData.pipe(SparkFiles.get(distScriptName)) opData.foreach(println) </code></pre>

How to Use both Scala and Python in a same Spark project?

2 Answers

You can indeed pipe out to a python script using Scala and Spark and a regular Python script.

test.py

#!/usr/bin/python  import sys  for line in sys.stdin:   print "hello " + line

spark-shell (scala)

val data = List("john","paul","george","ringo")  val dataRDD = sc.makeRDD(data)  val scriptPath = "./test.py"  val pipeRDD = dataRDD.pipe(scriptPath)  pipeRDD.foreach(println)

Output

hello john

hello ringo

hello george

hello paul

189

answered Oct 07 '22 13:10

Stephen De Gennaro

You can run the Python code via Pipe in Spark.

With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.

SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.

Example :

Python File : Code for making Input data to Uppercase

#!/usr/bin/python import sys for line in sys.stdin:     print line.upper()

Spark Code : For Piping the data

val conf = new SparkConf().setAppName("Pipe") val sc = new SparkContext(conf) val distScript = "/path/on/driver/PipeScript.py" val distScriptName = "PipeScript.py" sc.addFile(distScript) val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf")) val opData = ipData.pipe(SparkFiles.get(distScriptName)) opData.foreach(println)

answered Oct 07 '22 13:10

Ajay Gupta

Related questions
                            
                                Jacoco Android createDebugCoverageReport not found
                            
                                Matplotlib DateFormatter for axis label not working
                            
                                Get pointer to object from pointer to some member
                            
                                How to correctly queue up tasks to run in C#
                            
                                How do I inline an array of strings in a bash for loop?
                            
                                How do I properly insert multiple rows into PG with node-postgres?
                            
                                Using Ipython ipywidget to create a variable?
                            
                                What is best practice for sharing database between containers in docker?
                            
                                MongoDB InsertMany vs BulkWrite
                            
                                SSL Certificate missing from dropdown in SQL Server Configuration Manager
                            
                                How to use Python's random number generator with a local seed?
                            
                                Update a Cell with C# and Sheets API v4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Use both Scala and Python in a same Spark project?

Tags:

Wilson Liao

People also ask

2 Answers

Stephen De Gennaro

Ajay Gupta

Recent Activity

Donate For Us