Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get Python libraries in pyspark?

I want to use matplotlib.bblpath or shapely.geometry libraries in pyspark.

When I try to import any of them I get the below error:

>>> from shapely.geometry import polygon
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry

I know the module isn't present, but how can these packages be brought to my pyspark libraries?

like image 655
thenakulchawla Avatar asked Mar 25 '16 09:03

thenakulchawla


People also ask

Can you use Python libraries in PySpark?

Using Virtualenv Since Python 3.3, a subset of its features has been integrated into Python as a standard library under the venv module. PySpark users can use virtualenv to manage Python dependencies in their clusters by using venv-pack in a similar way as conda-pack.


1 Answers

In the Spark context try using:

SparkContext.addPyFile("module.py")  # also .zip

, quoting from the docs:

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

like image 85
armatita Avatar answered Sep 20 '22 00:09

armatita