I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.
Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?
I was in a similar situation and this is what I did:
use_databricks = os.getenv(MY_ENV_VAR, "dev") == "test"
if use_databricks:
from databricks.connect import DatabricksSession as SparkSession
else:
from pyspark.sql import SparkSession
The above code will check MY_ENV_VAR and based on it will load databricks-connect library or pyspark library. Python doesn't complain if you don't have it installed until you run it, I assume that is one of the benefits that Python is not a precompiled language.
I have also identified that if you have installed both pyspark and databricks-connect then databricks library is always overriding the main spark with own implementation. Therefore I had to split managing dependencies into two requirements files:
# requirements.txt
pyspark==3.5.0
pytest==7.4.3
and
# requirements-test.txt
-r requirements.txt
databricks-connect==14.0.1
When I work locally on my PC I want to run with local spark so I install only pip install -r requirements.txt
When I run in CI/CD I want to use databricks cluster and databricks-connect library and therefore there I install with pip install -r requirements-test.txt which will also include all dependencies from main requirements.txt file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With