I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:
/usr/local/lib/python3.4/site-packages/myutil
The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:
import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())
somewhere down the code, I am using this to convert one of the columns in my dataframe:
df = df.withColumn("newcol", myudf(df["oldcol"]))
then I am trying to see if it converts it my using:
df.head()
It fails with an error "No module named myutil".
I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?
Python's ImportError ( ModuleNotFoundError ) indicates that you tried to import a module that Python doesn't find. It can usually be eliminated by adding a file named __init__.py to the directory and then adding this directory to $PYTHONPATH .
PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.
You must build a egg file of your package using setup tools and add the egg file to your application like below
sc.addFile('<path of the egg file>')
here sc
is the spark context variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With