I have recently began using Azure Databricks and comparing to Jupyter Notebooks running on HDInsight. I have searched around and read documentation trying to learn how to have ADBricks use VSTS git for source control. However, I have not found a solution that works.
I have found instructions for using other git providers, but I want to be clear that is not an option for this use-case so please refrain from those types of responses.
HDInsight has similar limitations, but I could work around via ssh/rsync, and that was fine because I was deploying to the remote server, same way a build would, and able to blue/green deployments and the like same way a build would do.
For ADBricks, the cluster-on-demand is amazing, but there is an assumption that you're developing in Notebooks "on the cluster" and effectively you're in Continuous Delivery mode. This is just fine with me (except for the less-than-adequate, high-latency notebook development), but I still need to automate getting code to VSTS periodically to save state/backup like a good coder should :).
Typically for full CI/CD in Azure Databricks we use the workspace API to pull and push whole notebooks or directories from Databricks to a user's local machine or a build server. https://docs.azuredatabricks.net/api/latest/workspace.html
Databricks also has a CLI that leverages the workspace API for easier, higher-level commands: https://docs.azuredatabricks.net/user-guide/dev-tools/databricks-cli.html
The workflow for this looks something like this:
Here is a blog from Databricks that goes into more detail:https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With