I am currently optimizing our ETL process, and would like to be able to see the existing cluster configuration used when processing data. This way, I can track over time which worker node sizes I should use.
Is there a command to return the cluster worker # and sizes in python so I can write as a dataframe?
You can get this information by calling Cluster Get REST API - it will return JSON including the number of workers, node types, etc. Something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = "your_PAT_token"
cluster_id = ctx.tags().get("clusterId").get()
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
num_workers = response['num_workers']
P.S. if you have non-notebook job, then PAT token may not be available, but you can generate your token, and put it there instead
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With