Is there any reference as to what sets of versions are compatible between aws java sdk, hadoop, hadoop-aws bundle, hive, spark? For example, I know Spark is not compatible with hive versions above Hive 2.1.1

You cannot drop in a later version of the AWS SDK from what which hadoop-aws was built with and expect the s3a connector to work. Ever. That is now written down quite clearly in the S3A troubleshooting docs Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see. This may seem frustrating, given the rate at which the AWS team push out a new SDK, but you have to understand that (a) the API often changes incompatibly between versions (as you have seen), and (b) every release introduces/moves bugs which end up causing problems. Here is the 3.x timeline of things which broke on updates of the AWS SDK. <ul> <li>Move 1.11.86 and some tests hang under load.</li> <li>Fix: move to 1.11.134 leading to logs are full of AWS telling us off for deliberatly calling abort() on a read.</li> <li>Fix: move to 1.11.199 leading to logs full of stack traces.</li> <li>Fix: move to 1.11.271 and shaded JAR pulls in netty unshaded.</li> </ul> Every upgrade of the AWS SDK JAR causes a problem, somewhere. Sometimes an edit to the code and recompile, most commonly: logs filling up with false-alarm messages, dependency problems, threading quirks, etc. Things which can take time to surface. what you see when you get a hadoop release is not just an aws-sdk JAR which it was compiled against, you get a hadoop-aws JAR which contains the workarounds and fixes for whatever problems that release has introduced and which were identified in the minimum of 4 weeks of testing before the hadoop release ships. Which is why, no, you shouldn't be changing JARs unless you plan to do a complete end-to-end retest of the s3a client code, including load tests. You are encouraged to do that, the hadoop project always welcomes more testing of our pre-release code, with the Hadoop 3.1 binaries ready to play with. But trying to do it yourself by changing JARs? Sadly, an isolated exercise in pain.

hadoop aws versions compatibility

1 Answers

You cannot drop in a later version of the AWS SDK from what which hadoop-aws was built with and expect the s3a connector to work. Ever. That is now written down quite clearly in the S3A troubleshooting docs

Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

This may seem frustrating, given the rate at which the AWS team push out a new SDK, but you have to understand that (a) the API often changes incompatibly between versions (as you have seen), and (b) every release introduces/moves bugs which end up causing problems.

Here is the 3.x timeline of things which broke on updates of the AWS SDK.

Move 1.11.86 and some tests hang under load.
Fix: move to 1.11.134 leading to logs are full of AWS telling us off for deliberatly calling abort() on a read.
Fix: move to 1.11.199 leading to logs full of stack traces.
Fix: move to 1.11.271 and shaded JAR pulls in netty unshaded.

Every upgrade of the AWS SDK JAR causes a problem, somewhere. Sometimes an edit to the code and recompile, most commonly: logs filling up with false-alarm messages, dependency problems, threading quirks, etc. Things which can take time to surface.

what you see when you get a hadoop release is not just an aws-sdk JAR which it was compiled against, you get a hadoop-aws JAR which contains the workarounds and fixes for whatever problems that release has introduced and which were identified in the minimum of 4 weeks of testing before the hadoop release ships.

Which is why, no, you shouldn't be changing JARs unless you plan to do a complete end-to-end retest of the s3a client code, including load tests. You are encouraged to do that, the hadoop project always welcomes more testing of our pre-release code, with the Hadoop 3.1 binaries ready to play with. But trying to do it yourself by changing JARs? Sadly, an isolated exercise in pain.

111

answered Sep 30 '22 23:09

stevel

Related questions
                            
                                Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)
                            
                                What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]
                            
                                Spark java : Creating a new Dataset with a given schema
                            
                                Spark returning Pickle error: cannot lookup attribute
                            
                                spark streaming throughput monitoring
                            
                                How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?
                            
                                How to join two RDDs in spark with python?
                            
                                reducer concept in Spark
                            
                                Why does a method parameter cause NotSerializableException with Mockito?
                            
                                Pausing Dataproc cluster - Google Compute engine
                            
                                pyspark : Convert DataFrame to RDD[string]
                            
                                Scala Spark : How to create a RDD from a list of string and convert to DataFrame
                            
                                Performance Impact of RDD to JavaRDD conversion
                            
                                Spark - Divide int with column?
                            
                                ClassCastException: org.apache.spark.ml.linalg.DenseVector cannot be cast to org.apache.spark.mllib.linalg.Vector
                            
                                How to convert Avro Schema object into StructType in spark
                            
                                Spark.ml regressions do not calculate same models as scikit-learn
                            
                                What is the use of --driver-class-path in the spark command?
                            
                                Filter Spark Dataframe with a variable
                            
                                Date and Interval Addition in SparkSQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

hadoop aws versions compatibility

Tags:

amazon-s3

apache-spark

hadoop

hive

tooptoop4

People also ask

1 Answers

stevel

Recent Activity

Donate For Us