Switching between Databricks Connect and local Spark environment

Question

I am looking to use Databricks Connect for developing a pyspark pipeline. DBConnect is really awesome because I am able to run my code on the cluster where the actual data resides, so it's perfect for integration testing, but I also want to be able to, during development and unit testing (pytest with pytest-spark), simply using a local Spark environment.

Is there any way to configure DBConnect so for one use-case I simply use a local Spark environment, but for another it uses DBConnect?

99problems · Accepted Answer

I was in a similar situation and this is what I did:

use_databricks = os.getenv(MY_ENV_VAR, "dev") == "test"

if use_databricks:
    from databricks.connect import DatabricksSession as SparkSession
else:
    from pyspark.sql import SparkSession

The above code will check MY_ENV_VAR and based on it will load databricks-connect library or pyspark library. Python doesn't complain if you don't have it installed until you run it, I assume that is one of the benefits that Python is not a precompiled language.

I have also identified that if you have installed both pyspark and databricks-connect then databricks library is always overriding the main spark with own implementation. Therefore I had to split managing dependencies into two requirements files:

# requirements.txt
pyspark==3.5.0
pytest==7.4.3

and

# requirements-test.txt
-r requirements.txt

databricks-connect==14.0.1

When I work locally on my PC I want to run with local spark so I install only pip install -r requirements.txt

When I run in CI/CD I want to use databricks cluster and databricks-connect library and therefore there I install with pip install -r requirements-test.txt which will also include all dependencies from main requirements.txt file.

Switching between Databricks Connect and local Spark environment

Tags:

apache-spark

pyspark

databricks

databricks-connect

casparjespersen

1 Answers

99problems

Recent Activity

Donate For Us

Switching between Databricks Connect and local Spark environment

Tags:

apache-spark

pyspark

databricks

databricks-connect

casparjespersen

1 Answers

99problems

Related questions

Recent Activity

Donate For Us