Is there any reference as to what sets of versions are compatible between aws java sdk, hadoop, hadoop-aws bundle, hive, spark?
For example, I know Spark is not compatible with hive versions above Hive 2.1.1
Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Introducing the Hadoop S3A client. Hadoop's “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations. Directly reads and writes S3 objects. Compatible with standard S3 clients. Compatible with files created by the older s3n:// client and Amazon EMR's s3:// client.
AWS Java SDK :: Bundle A single bundled dependency that includes all service and dependent JARs with third-party libraries relocated to different namespaces.
You cannot drop in a later version of the AWS SDK from what which hadoop-aws was built with and expect the s3a connector to work. Ever. That is now written down quite clearly in the S3A troubleshooting docs
Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.
This may seem frustrating, given the rate at which the AWS team push out a new SDK, but you have to understand that (a) the API often changes incompatibly between versions (as you have seen), and (b) every release introduces/moves bugs which end up causing problems.
Here is the 3.x timeline of things which broke on updates of the AWS SDK.
Every upgrade of the AWS SDK JAR causes a problem, somewhere. Sometimes an edit to the code and recompile, most commonly: logs filling up with false-alarm messages, dependency problems, threading quirks, etc. Things which can take time to surface.
what you see when you get a hadoop release is not just an aws-sdk JAR which it was compiled against, you get a hadoop-aws JAR which contains the workarounds and fixes for whatever problems that release has introduced and which were identified in the minimum of 4 weeks of testing before the hadoop release ships.
Which is why, no, you shouldn't be changing JARs unless you plan to do a complete end-to-end retest of the s3a client code, including load tests. You are encouraged to do that, the hadoop project always welcomes more testing of our pre-release code, with the Hadoop 3.1 binaries ready to play with. But trying to do it yourself by changing JARs? Sadly, an isolated exercise in pain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With