Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong [duplicate]

I have a Spark cluster on DC/OS and I am running a Spark job that reads from S3. The versions are the following:

  • Spark 2.3.1
  • Hadoop 2.7
  • The dependency for AWS connection: "org.apache.hadoop" % "hadoop-aws" % "3.0.0-alpha2"

I read in the data by doing the following:

`val hadoopConf = sparkSession.sparkContext.hadoopConfiguration
    hadoopConf.set("fs.s3a.endpoint", Config.awsEndpoint)
    hadoopConf.set("fs.s3a.access.key", Config.awsAccessKey)
    hadoopConf.set("fs.s3a.secret.key", Config.awsSecretKey)
    hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

val data = sparkSession.read.parquet("s3a://" + "path/to/file")

` The error I am getting is:

Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:215)
    at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:138)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:170)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:44)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:321)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:809)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:182)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:207)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

This job only fails if I submit it as a JAR to the cluster. If I run the code locally or in a docker container, it does not fail and is perfectly able to read in the data.

I would be very grateful if anyone could help me with this!

like image 303
Giselle Van Dongen Avatar asked Jul 25 '18 21:07

Giselle Van Dongen


1 Answers

This is one of the stack traces you get to see when you mix Hadoop-* jars.

As the S3A docs say

Critical: Do not attempt to “drop in” a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.

Randomly changing hadoop- and aws- JARs in the hope of making a problem “go away” or to gain access to a feature you want, will not lead to the outcome you desire.

like image 129
stevel Avatar answered Nov 15 '22 16:11

stevel