Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why Livy or spark-jobserver instead of a simple web framework?

I'm building a RESTful API on top of Apache Spark. Serving the following Python script with spark-submit seems to work fine:

import cherrypy
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('myApp').getOrCreate()
sc = spark.sparkContext

class doStuff(object):
    @cherrypy.expose
    def compute(self, user_input):
        # do something spark-y with the user input
        return user_output

cherrypy.quickstart(doStuff())

But googling around I see things like Livy and spark-jobserver. I read these projects' documentation and a couple of tutorials but I still don't fully understand the advantages of Livy or spark-jobserver over a simple script with CherryPy or Flask or any other web framework. Is it about scalability? Context management? What am I missing here? If what I want is a simple RESTful API with not many users, are Livy or spark-jobserver worth the trouble? If so, why?

like image 667
Parzival Avatar asked Jan 11 '17 20:01

Parzival


2 Answers

If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run

If you use Livy or spark-jobserver, then you can programatically upload file and run job. You can add additional applications that will connect to same cluster and upload jar with next job

What's more, Livy and Spark-JobServer allows you to use Spark in interactive mode, which is hard to do with spark-submit ;)

like image 141
T. Gawęda Avatar answered Sep 23 '22 07:09

T. Gawęda


I won't comment on using Livy or spark-jobserver specifically but are at least three reasons to avoid embedding Spark context directly in your application:

  • Security with the main focus on reducing exposure of your cluster to the outside world. Attacker which gains control over your application can do anything between getting access to your data to executing arbitrary code on your cluster if cluster is not correctly configured.

  • Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows you to handle Spark issues gracefully, without full downtime of your application.

  • Responsiveness. User facing Spark API is mostly (in PySpark exclusively) synchronous. Using external service basically solves this problem for you.

like image 34
zero323 Avatar answered Sep 24 '22 07:09

zero323