Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.lang.UnsupportedOperationException: Error in spark when writing

When I try to write the dataset into parquet files, I get below error

18/11/05 06:25:43 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 84 in stage 1.0 failed 4 times, most recent failure: Lost task 84.3 in stage 1.0 (TID 989, ip-10-253-194-207.nonprd.aws.csp.net, executor 4): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
        at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
        at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

But when i give dataset.show() I am able to view the data. Not sure where to check for the root cause.

like image 470
John Humanyun Avatar asked Oct 17 '22 11:10

John Humanyun


1 Answers

There is an easier way to detect schema differences between parquet files, use option mergeSchema which will show you the fields inconsistent in the log

example code:

spark.read.option("mergeSchema", "True").parquet(fileList:_*) 

example log:

Caused by: org.apache.spark.SparkException: Failed to merge fields 'field1' and 'field1'. Failed to merge incompatible data types DoubleType and LongType
like image 107
murat yildirim Avatar answered Oct 21 '22 08:10

murat yildirim