Is that possible to pipe Spark RDD to Python?
Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?
Speed of performanceScala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.
You can indeed pipe out to a python script using Scala and Spark and a regular Python script.
This thread has a dated performance comparison. “Regular” Scala code can run 10-20x faster than “regular” Python code, but that PySpark isn't executed liked like regular Python code, so this performance comparison isn't relevant. PySpark is converted to Spark SQL and then executed on a JVM cluster.
Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory: Scala.
You can indeed pipe out to a python script using Scala and Spark and a regular Python script.
test.py
#!/usr/bin/python import sys for line in sys.stdin: print "hello " + line
spark-shell (scala)
val data = List("john","paul","george","ringo") val dataRDD = sc.makeRDD(data) val scriptPath = "./test.py" val pipeRDD = dataRDD.pipe(scriptPath) pipeRDD.foreach(println)
Output
hello john
hello ringo
hello george
hello paul
You can run the Python code via Pipe in Spark.
With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.
SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.
Example :
Python File : Code for making Input data to Uppercase
#!/usr/bin/python import sys for line in sys.stdin: print line.upper()
Spark Code : For Piping the data
val conf = new SparkConf().setAppName("Pipe") val sc = new SparkContext(conf) val distScript = "/path/on/driver/PipeScript.py" val distScriptName = "PipeScript.py" sc.addFile(distScript) val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf")) val opData = ipData.pipe(SparkFiles.get(distScriptName)) opData.foreach(println)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With