Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use new Hadoop parquet magic commiter to custom S3 server with Spark

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name          magic
spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?

like image 874
Kiwy Avatar asked Nov 20 '18 08:11

Kiwy


People also ask

Does Hadoop allow write of Parquet file to S3 consistently?

According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf: My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

Why choose the right S3 committer for your spark job?

On average, a large portion of Spark jobs are spent writing to S3, so choosing the right S3 committer is important for AWS Spark users. With the Apache Spark 3.2 release in October 2021, a special type of S3 committer called the magic committer has been significantly improved, making it more performant, more stable, and easier to use.

How to write spark dataframe in Parquet file to Amazon S3?

Using spark.write.parquet () function we can write Spark DataFrame in Parquet file to Amazon S3.The parquet () function is provided in DataFrameWriter class.

Why is parquet not supported in spark?

parquet is not “natively” supported in spark, instead, spark relies on hadoop support for the parquet format – this is not a problem in itself, but for us it caused major performance issues when we tried to use spark and parquet with s3 – more on that in the next section


1 Answers

Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.
However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard

like image 175
Kiwy Avatar answered Sep 29 '22 06:09

Kiwy