Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0

Tags:

Dynamic partitioning introduced by Spark 2.3 doesn't seem to work on AWS's EMR 5.13.0 when writing to S3

When executing, a temporary directory is created in S3 but it disappears once the process is completed without writing the new data to the final folder structure.

The issue was found when executing a Scala/Spark 2.3 application on EMR 5.13.0.

The configuration is as follows:

var spark = SparkSession
  .builder
  .appName(MyClass.getClass.getSimpleName)
  .getOrCreate()

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") // also tried "dynamic"

The code that writes to S3:

val myDataset : Dataset[MyType] = ...

val w = myDataset
    .coalesce(10)
    .write
    .option("encoding", "UTF-8")
    .option("compression", "snappy")
    .mode("overwrite")
    .partitionBy("col_1","col_2")

w.parquet(s"$destinationPath/" + Constants.MyTypeTableName)

With destinationPath being a S3 bucket/folder

Anyone else has experienced this issue?

957

asked May 10 '18 17:05

David Costa Faidella

1 Answers

Upgrading to EMR 5.19 fixes the problem. However my previous answer is incorrect - using the EMRFS S3-optimized Committer has nothing to do with it. The EMRFS S3-optimized Committer is silently skipped when spark.sql.sources.partitionOverwriteMode is set to dynamic: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html

~~If you can upgrade to at least EMR 5.19.0, AWS's EMRFS S3-optimized Committer solves these issues.~~

--conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true

See: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

answered Oct 20 '22 08:10

Robert Bart

Related questions
                            
                                Scala implicit conversion on call-by-name parameter works differently depending on the function is overloaded or not
                            
                                Make Scala's implicitNotFound annotation more precise
                            
                                Spark partitionBy much slower than without it
                            
                                Why does Scala implicit resolution fail for overloaded method with type parameter?
                            
                                Why is this Clojure program working on a mutable array so slow?
                            
                                Akka Distributed Pub/Sub and number of named topics
                            
                                playframework 2.5.0 Template fail to compile (fresh one)
                            
                                Programmatically get routes in Play! Framework 2.5.x
                            
                                Inferring type of generic implicit parameter from return type
                            
                                PlayFramework 2.5: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"
                            
                                Stacking Monad Transformers in scala
                            
                                How filter Scan of HBase by part of row key?
                            
                                Scala, different behaviour when calling `map(f)` vs `map(v => f(v))`
                            
                                SBT out of memory with subprojects
                            
                                Ignore None field while Encoding to json with Circe for Scala
                            
                                How to define table name at runtime using quill
                            
                                Redirect HTTP requests to HTTPS on Google App Engine and Play Framework
                            
                                Efficient load CSV coordinate format (COO) input to local matrix spark
                            
                                Remove or Exclude WatchSource in sbt 1.0.x
                            
                                SparkAppHandle Listener not getting invoked

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0

Tags:

amazon-s3

scala

apache-spark

bigdata

amazon-emr

David Costa Faidella

People also ask

1 Answers

Robert Bart

Recent Activity

Donate For Us