I am saving a dataframe to a CSV file in PySpark using below statement:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
But i am getting below error
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
i am using PySpark version 2.3.0
I am getting this error while trying to write to a file.
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')
The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount
to sent it to the executors.
So, somewhere in your method create_emi_amount
you use or import the app module.
A solution to your problem is to use the same environment in both driver and executors. In spark-env.sh
set the save Python virtualenv in PYSPARK_DRIVER_PYTHON=...
and PYSPARK_PYTHON=...
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With