I try to show the Pyspark Dataframe, and I encounter such error:
Py4JJavaError: An error occurred while calling o607.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 4 times, most recent failure: Lost task 0.3 in stage 114.0 (TID 15904, zw02-data-hdp-dn25211.mt, executor 416): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
for obj in iterator:
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 209, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
return lambda *a: f(*a)
File "<ipython-input-12-2ecb67285c3b>", line 5, in <lambda>
File "<ipython-input-12-2ecb67285c3b>", line 4, in convert_target
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DenseVector'
This is my code, and it runs on Jupyter:
df2 = spark.sql(sql_text)
assembler = VectorAssembler(inputCols=["targetstep"], outputCol="x_vec")
scaler = MinMaxScaler(inputCol="x_vec", outputCol="targetstep_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
scalerModel = pipeline.fit(df2)
df2 = scalerModel.transform(df2)
df2 = df2.withColumn('targetstep',target_udf(f.col('targetstep_scaled'))).drop('x_vec')
df2.show()
I'm sure that the Pipeline and withColumn()
is ok. but I don't konw why the show method fails.
PySpark DF are lazy loading.
When you call .show() you are asking the prior steps to execute and anyone of them may not work, you just can't see it until you call .show() because they haven't executed.
I go back to earlier steps and call .collect() on each operation of the DF. This will at least allow you to isolate where the bad data was created.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With