Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High Concurrency Clusters in Databricks

This from Databricks docs:

High Concurrency clusters

A High Concurrency cluster is a managed cloud resource. The key benefits of High Concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.

High Concurrency clusters work only for SQL, Python, and R. The performance and security of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.

  • Why is this not possible for Scala? Prime aspect of this question.

  • Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies. I have a good idea how Spark works, but may be a clarification here in case there is some extra consideration.

  • As an aside then, are most then all running their own standard clusters per person or using HAC's using SQL etc. I think HAC's would be the way to go - in general.

like image 413
thebluephantom Avatar asked Jan 24 '21 10:01

thebluephantom


1 Answers

High Concurrency clusters are intended for use by multiple users, sharing the cluster resources, and isolating every notebook, etc. For all mentioned languages, separate processes are created and they execute the code that is limited to API exposed by Spark to that language. But Scala code will be executed inside the Spark JVM (per machine) that is shared between all users, so you can get access to everything that is inside JVM.

Typically, when using High Concurrency clusters, the Table ACLs are also enabled (only for SQL or SQL+Python), so you can control who can access which data, enforce row/column level access control, etc.

like image 97
Alex Ott Avatar answered Sep 18 '22 09:09

Alex Ott