Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Apache Spark as a backend for web application [closed]

We have Terabytes of data stored in HDFS, comprising of customer data and behavioral information. Business Analysts want to perform slicing and dicing of this data using filters.

These filters are similar to Spark RDD filters. Some examples of the filter are: age > 18 and age < 35, date between 10-02-2015, 20-02-2015, gender=male, country in (UK, US, India), etc. We want to integrate this filter functionality in our JSF (or Play) based web application.

Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.

We are planning to use Scala as a programming language for implementing the filters. The web application would initialize a single SparkContext at the load of the server, and every filter would reuse the same SparkContext.

Is Spark good for this use case of interactive querying through a web application. Also, the idea of sharing a single SparkContext, is this a work-around approach? The other alternative we have is Apache Hive with Tez engine using ORC compressed file format, and querying using JDBC/Thrift. Is this option better than Spark, for the given job?

like image 670
Raju Rama Krishna Avatar asked Mar 26 '15 10:03

Raju Rama Krishna


4 Answers

Apache Livy enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). So, multiple users can interact with your Spark cluster concurrently.

like image 53
leo9r Avatar answered Oct 23 '22 09:10

leo9r


I'd like to know which solution you chose in the end.

I have two propositions:

  1. following the zeppelin idea of @quickinsights, there is also the interactive notebook jupyter that is well established now. It is firstly designed for python, but specialized kernel can be installed. I tried using toree a couple of month ago. The basic installation is simple:

    pip install jupyter

    pip install toree

    jupyter install toree

    but at the time I had to do a couple low level twicks to make it works (s.as editing /usr/local/share/jupyter/kernels/toree/kernel.json). But it worked and I could use a spark cluster from a scala notebook. Check this tuto, it fits what I have in memory.

  2. Most (all?) docs on spark speak about running app with spark-submit or using spark-shell for interactive usage (sorry but spark&scala shell are so disappointing...). They never speak about using spark in an interactive app, such as a web-app. It is possible (I tried), but there are indeed some issues to be check, such as sharing sparkContext as you mentioned, and also some issues about managing dependencies. You can checks the two getting-started-prototypes I made to use spark in a spring web-app. It is in java, but I would strongly recommend using scala. I did not work long enough with this to learn a lot. However I can say that it is possible, and it works well (tried on a 12 nodes cluster + app running on an edge node)

    Just remember that the spark driver, i.e. where the code with rdd is running, should be physically on the same cluster that the spark nodes: there are lots of communications between the driver and the workers.

like image 45
Juh_ Avatar answered Oct 23 '22 09:10

Juh_


Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.

Apache Zeppelin provides a framework for interactively ingesting and visualizing data (via web application) using apache spark as the back end. Here is a video demonstrating the features.

Also, the idea of sharing a single SparkContext, is this a work-around approach?

It looks like that project uses a single sparkContext for low latency query jobs.

like image 40
quickinsights Avatar answered Oct 23 '22 11:10

quickinsights


It's not the best use case for Spark, but it is completely possible. The latency can be high though.

You might want to check out Spark Jobserver, it should offer most of your required features. You can also get an SQL view over your data using Spark's JDBC Thrift server.

In general I'd advise using SparkSQL for this, it already handles a lot of the things you might be interested in.

Another option would be to use Databricks Cloud, but it's not publicly available yet.

like image 21
Marius Soutier Avatar answered Oct 23 '22 10:10

Marius Soutier