Error while I am using DataFrame show method in Pyspark

Question

I try to show the Pyspark Dataframe, and I encounter such error:

Py4JJavaError: An error occurred while calling o607.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 4 times, most recent failure: Lost task 0.3 in stage 114.0 (TID 15904, zw02-data-hdp-dn25211.mt, executor 416): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
    return lambda *a: f(*a)
  File "<ipython-input-12-2ecb67285c3b>", line 5, in <lambda>
  File "<ipython-input-12-2ecb67285c3b>", line 4, in convert_target
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DenseVector'

This is my code, and it runs on Jupyter:

df2 = spark.sql(sql_text)
assembler = VectorAssembler(inputCols=["targetstep"], outputCol="x_vec")
scaler = MinMaxScaler(inputCol="x_vec", outputCol="targetstep_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
scalerModel = pipeline.fit(df2)
df2 = scalerModel.transform(df2)
df2 = df2.withColumn('targetstep',target_udf(f.col('targetstep_scaled'))).drop('x_vec')
df2.show()

I'm sure that the Pipeline and withColumn() is ok. but I don't konw why the show method fails.

DaveP · Accepted Answer

PySpark DF are lazy loading.

When you call .show() you are asking the prior steps to execute and anyone of them may not work, you just can't see it until you call .show() because they haven't executed.

I go back to earlier steps and call .collect() on each operation of the DF. This will at least allow you to isolate where the bad data was created.

Error while I am using DataFrame show method in Pyspark

Tags:

python

apache-spark

pyspark

apache-spark-mllib

yu song

1 Answers

DaveP

Recent Activity

Donate For Us

Error while I am using DataFrame show method in Pyspark

Tags:

python

apache-spark

pyspark

apache-spark-mllib

yu song

1 Answers

DaveP

Related questions

Recent Activity

Donate For Us