Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good Databricks workflow

I'm using Azure Databricks for data processing, with notebooks and pipeline.

I'm not satisfied with my current workflow:

  • The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
  • My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
  • Is there a way to do efficient and systematic testing ?
  • Git integration is very simple, but this is not my main concern.
like image 663
Be Chiller Too Avatar asked Nov 12 '19 16:11

Be Chiller Too


1 Answers

Great question. Definitely dont modify your production code in place.

One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.

You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.

Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.

Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.

like image 73
Raphael K Avatar answered Oct 12 '22 11:10

Raphael K