I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit
seems to work fine:
import cherrypy
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext
class doStuff(object):
@cherrypy.expose
def compute(self, user_input):
# do something spark-y with the user input
return user_output
cherrypy.quickstart(doStuff())
But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?
If you use spark-submit
, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job
What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)
I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:
Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.
Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With