Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Use both Scala and Python in a same Spark project?

Tags:

Is that possible to pipe Spark RDD to Python?

Because I need a python library to do some calculation on my data, but my main Spark project is based on Scala. Is there a way to mix them both or let python access the same spark context?

like image 423
Wilson Liao Avatar asked Oct 06 '15 17:10

Wilson Liao


People also ask

Is Spark better with Scala or Python?

Speed of performanceScala is faster than Python due to its static type language. If faster performance is a requirement, Scala is a good bet. Spark is native in Scala, hence making writing Spark jobs in Scala the native way.

Can we use Python in Scala?

You can indeed pipe out to a python script using Scala and Spark and a regular Python script.

Is Scala faster than PySpark?

This thread has a dated performance comparison. “Regular” Scala code can run 10-20x faster than “regular” Python code, but that PySpark isn't executed liked like regular Python code, so this performance comparison isn't relevant. PySpark is converted to Spark SQL and then executed on a JVM cluster.

Can I use Scala in PySpark?

Spark's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory: Scala.


2 Answers

You can indeed pipe out to a python script using Scala and Spark and a regular Python script.

test.py

#!/usr/bin/python  import sys  for line in sys.stdin:   print "hello " + line 

spark-shell (scala)

val data = List("john","paul","george","ringo")  val dataRDD = sc.makeRDD(data)  val scriptPath = "./test.py"  val pipeRDD = dataRDD.pipe(scriptPath)  pipeRDD.foreach(println) 

Output

hello john

hello ringo

hello george

hello paul

like image 189
Stephen De Gennaro Avatar answered Oct 07 '22 13:10

Stephen De Gennaro


You can run the Python code via Pipe in Spark.

With pipe(), you can write a transformation of an RDD that reads each RDD element from standard input as String, manipulates that String as per script instruction, and then writes the result as String to standard output.

SparkContext.addFile(path), we can add up list of files for each of the worker nodes to download when a Spark job starts.All the worker node will have their copy of the script thus we will be getting parallel operation by pipe. We need to install all the libraries and dependency prior to it on all the worker and executor node.

Example :

Python File : Code for making Input data to Uppercase

#!/usr/bin/python import sys for line in sys.stdin:     print line.upper() 

Spark Code : For Piping the data

val conf = new SparkConf().setAppName("Pipe") val sc = new SparkContext(conf) val distScript = "/path/on/driver/PipeScript.py" val distScriptName = "PipeScript.py" sc.addFile(distScript) val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf")) val opData = ipData.pipe(SparkFiles.get(distScriptName)) opData.foreach(println) 
like image 34
Ajay Gupta Avatar answered Oct 07 '22 13:10

Ajay Gupta