Parquet error when saving from Spark

Tags:

apache-spark

parquet

After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception when saving to Amazon's S3.

logsForDate
    .repartition(10)
    .saveAsParquetFile(destination) // <-- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.

487

asked Apr 30 '15 06:04

Interfector

2 Answers

I can actually reproduce this problem with Spark 1.3.1 on EMR, when saving to S3.

However, saving to HDFS works fine. You could save to HDFS first, and then use e.g. s3distcp to move the files to S3.

193

answered Oct 27 '22 08:10

Eric Eijkelenboom

I faced with this error when saveAsParquetFile into HDFS. It was because datanode socket write timeout, therefore I change it to a longer one in Hadoop Settings:

<property>
  <name>dfs.datanode.socket.write.timeout</name>
  <value>3000000</value>
</property>
<property>
  <name>dfs.socket.timeout</name>
  <value>3000000</value>
</property>

Hope this helps if you could set S3 like this.

answered Oct 27 '22 10:10

yjshen

Related questions
                            
                                Spark streaming + Kafka vs Just Kafka
                            
                                Spark for kubernetes - Azure Blob Storage credentials issue
                            
                                Websphere MQ as a data source for Apache Spark Streaming
                            
                                How to integrate Apache Spark with Spring MVC web application for interactive user sessions
                            
                                ClassNotFoundException: org.apache.spark.SparkConf with spark on hive
                            
                                pyLDAvis visualization of pyspark generated LDA model
                            
                                Apache Spark: User Memory vs Spark Memory
                            
                                KryoException: Buffer overflow with very small input
                            
                                Submitting jobs to Spark EC2 cluster remotely
                            
                                Do Parquet Metadata Files Need to be Rolled-back?
                            
                                Spark EC2 SSH connection error SSH return code 255
                            
                                Spark program gives odd results when ran on standalone cluster
                            
                                How many partitions does Spark create when a file is loaded from S3 bucket?
                            
                                Structured streaming won't write DF to file sink citing /_spark_metadata/9.compact doesn't exist
                            
                                Does Spark use data locality?
                            
                                spark executor lost failure
                            
                                Apache Spark Streaming, How to handle Downstream dependency failures
                            
                                Reliability issues with Checkpointing/WAL in Spark Streaming 1.6.0
                            
                                How to solve this error org.apache.spark.sql.catalyst.errors.package$TreeNodeException
                            
                                Spark Streaming: Could not compute split, block not found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With