Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark returns a no module named error for a custom module

Tags:

python

pyspark

I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:

/usr/local/lib/python3.4/site-packages/myutil

The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:

import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())

somewhere down the code, I am using this to convert one of the columns in my dataframe:

df = df.withColumn("newcol", myudf(df["oldcol"]))

then I am trying to see if it converts it my using:

df.head()

It fails with an error "No module named myutil".

I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?

like image 916
Arvind Kandaswamy Avatar asked Jul 21 '17 13:07

Arvind Kandaswamy


People also ask

How do you fix an import error in Python?

Python's ImportError ( ModuleNotFoundError ) indicates that you tried to import a module that Python doesn't find. It can usually be eliminated by adding a file named __init__.py to the directory and then adding this directory to $PYTHONPATH .

What is Pyspark?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


1 Answers

You must build a egg file of your package using setup tools and add the egg file to your application like below

sc.addFile('<path of the egg file>') 

here sc is the spark context variable.

like image 69
rogue-one Avatar answered Oct 08 '22 17:10

rogue-one