Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the cluster manager used in Databricks ? How do I change the number of executors in Databricks clusters?

Tags:

What is the cluster manager used in Databricks? How do I change the number of executors in Databricks clusters ?

like image 666
prady Avatar asked Jul 15 '19 18:07

prady


2 Answers

What is the cluster manager used in Databricks?

Azure Databricks builds on the capabilities of Spark by providing a zero-management cloud platform that includes:

  • Fully managed Spark clusters
  • An interactive workspace for exploration and visualization
  • A platform for powering your favorite Spark-based applications

The Databricks Runtime is built on top of Apache Spark and is natively built for the Azure cloud.

With the Serverless option, Azure Databricks completely abstracts out the infrastructure complexity and the need for specialized expertise to set up and configure your data infrastructure. The Serverless option helps data scientists iterate quickly as a team.

For data engineers, who care about the performance of production jobs, Azure Databricks provides a Spark engine that is faster and performant through various optimizations at the I/O layer and processing layer (Databricks I/O).

How do I change the number of executors in Databricks clusters ?

When you create a cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.

When you provide a fixed size cluster: Azure Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.

With autoscaling: Azure Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).

Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:

  • Workloads can run faster compared to a constant-sized under-provisioned cluster.
  • Autoscaling clusters can reduce overall costs compared to a statically-sized cluster.

Note: Depending on the constant size of the cluster and the workload, autoscaling gives you one or both of these benefits at the same time. The cluster size can go below the minimum number of workers selected when the cloud provider terminates instances. In this case, Azure Databricks continuously retries to re-provision instances in order to maintain the minimum number of workers.

Cluster autoscaling is not available for spark-submit jobs. To learn more about autoscaling, see Cluster autoscaling.

Hope this helps.

like image 161
CHEEKATLAPRADEEP-MSFT Avatar answered Nov 15 '22 04:11

CHEEKATLAPRADEEP-MSFT


To answer the question:

What is the cluster manager used in Databricks?

I try to dig this information out, but I couldn't find any info about it from the official docs.

It seems like Databricks is not using any of the cluster managers from Spark mentioned here

According to this presentation, On page 23, it mentions 3 parts of Databricks cluster manager

  • Instance manager
  • Resource manager
  • Spark Cluster manager

So I guess Databricks uses its own pripriotory cluster manager.

like image 40
fuyi Avatar answered Nov 15 '22 06:11

fuyi