Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: How can I import a GitHub library into Databricks?

I would like to use this library for anomaly detection in Databricks: iForest .This library can not be installed through PyPi.

How can I install libraries from GitHub in Databricks? I read about using something called an "egg" but I don't quite understand how it should be used.

like image 935
DataBach Avatar asked Oct 28 '25 02:10

DataBach


1 Answers

You can clone the repo and create a Python package as explained here : https://github.com/titicaca/spark-iforest:

Step 2. Package pyspark-iforest and install it via pip, skip this step if you don't need the python pkg

cd spark-iforest/python

python setup.py sdist

pip install dist/pyspark-iforest-<version>.tar.gz

Here you only need the 2 first commands to generate the package but you have to change the second one to generate an egg package instead of source distribution package:

python3 setup.py bdist_egg

Now, you'll find the file in /dist folder:

pyspark_iforest-2.4.0-py3.7.egg

Finally, on Databricks, select Create > Library and choose Python Egg to upload the generated file. More details can be found here.

like image 93
blackbishop Avatar answered Oct 31 '25 03:10

blackbishop