Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is Apache Zeppelin stable enough to be used in Production

I am using AWS EMR cluster. I have been experimenting with Spark Drivers and Apache Zeppelin Rest APIs to run jobs. I have run several hundred adhoc jobs with Zeppelin and didn't have any concern. With that fact I am considering to use Zeppelin Rest APIs in production. Will be submitting jobs using Rest APIs.

Has anyone experienced stability issues with Zeppelin in Production?

like image 294
Dinesh Maheshwari Avatar asked Feb 05 '23 21:02

Dinesh Maheshwari


1 Answers

I have a zeppelin running in production in a multiuser environment (+/- 15 users) and it hasn't been very stable. To make it more stable I run zeppelin on its own node, not any longer on the master node.

Anyway, I found the following problems:

  • In the releases before 0.7.2 Zeppelin created a lot of zombie processes, which causes memory problems after heavy usage.
  • User libraries can break Zeppelin, this has been the case in the versions prior 0.7.0. E.g. Jackson libraries make Zeppelin unable to communicate with the spark interpreter. In 0.7.0 and up this problem has been mitigated.
  • There are random freezes when there are a lot of users. The only way to fix this, is a restart of the service. (All versions)
  • Sometimes when a user starts his interpreter and the local repo is empty, zeppelin doesn't download all the libraries specified in the interpreter config. Then it won't download them again, the only way to mitigate this is to delete the contents of the local repo of the interpreter. (All versions)
  • Sometimes changes on notebooks don't get saved, which causes users to loose code.
  • In version 0.6.0 spark interpreters shared a context, which caused users to overwrite each other variables.
  • Problems are difficult to debug, the logging is not that great yet. Some bugs seem to break the logging and sometimes running an interpreter in debug mode fixes the problem.

So, I wouldn't put it in a production setting yet, where people depend on it. But for testing and data discovery it would be fine. Zeppelin is clearly still in a beta stage.

Also don't run it on the master node, but setup your own instance and let it connect remotely to the cluster. This makes it much more stable. Put it on a beefy node and restart it overnight.

Most of the bugs I encountered are already on the Jira and the developers are working hard to make things better. The stability becomes better and better every release and I see the maintenance load going down every version, so it certainly has potential.

like image 116
Edgar Klerks Avatar answered Mar 31 '23 22:03

Edgar Klerks