How to use new Hadoop parquet magic commiter to custom S3 server with Spark

Tags:

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name          magic
spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?

874

asked Nov 20 '18 08:11

Kiwy

1 Answers

Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.
However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard

175

answered Sep 29 '22 06:09

Kiwy

Related questions
                            
                                spark - join one to many relationship dataframes
                            
                                Cannot change hive.exec.max.dynamic.partitions in Spark
                            
                                How to automate StructType creation for passing RDD to DataFrame
                            
                                How to expose Spark Driver behind dockerized Apache Zeppelin?
                            
                                Running from a local IDE against a remote Spark cluster
                            
                                spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096
                            
                                How Spark HashingTF works
                            
                                Spark load settings from multiple configuration files
                            
                                How to convert bytes from Kafka to their original object?
                            
                                Spark cosine distance between rows using Dataframe
                            
                                PCA output in Spark doesn't matches with scikit-learn
                            
                                Using Spark Structured Streaming to Read Data From Kafka, Issue of Over-time is Always Occured
                            
                                Caching dataframes while keeping partitions
                            
                                Can't pickle _thread.lock objects Pyspark send request to elasticseach
                            
                                AnalysisException: Queries with streaming sources must be executed with writeStream.start()
                            
                                Watermarking for Spark structured streaming with three way joins
                            
                                connecting mysql with pyspark
                            
                                Spark Dataset when to use Except vs Left Anti Join
                            
                                Reading a custom pyspark transformer
                            
                                Strange behavior when using toDF() function to transfrom RDD to Dataframe in PySpark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

Tags:

amazon-s3

apache-spark

hadoop

Kiwy

People also ask

1 Answers

Kiwy

Recent Activity

Donate For Us