I am trying to list all objects in a bucket, and then read some or all of them as CSV. I have spent two days now, trying to do both, but I can only get one working at a time if I'm using Google's libraries.
I think the problem lies in an incompatibility between Google's own libraries, but I'm not entirely sure. First, I think I should show how I'm doing each thing.
This is how I'm reading a single file. In my version of Scala, you can use a gs://
url with spark.read.csv
:
val jsonKeyFile = "my-local-keyfile.json"
ss.sparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile", jsonKeyFile)
spark.read
.option("header", "true")
.option("sep", ",")
.option("inferSchema", "false")
.option("mode", "FAILFAST")
.csv(gcsFile)
This actually works alone, and I get a working DF from it. Then the problem arises when I try to add Google's Storage library:
libraryDependencies += "com.google.cloud" % "google-cloud-storage" % "1.70.0"
If I try to run the same code again, I get this bad boy from the .csv call:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/14 16:38:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
An exception or error caused a run to abort: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier
java.lang.IncompatibleClassChangeError: Class com.google.common.base.Suppliers$SupplierOfInstance does not implement the requested interface java.util.function.Supplier
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getGcsFs(GoogleHadoopFileSystemBase.java:1488)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1659)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:683)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:646)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
...(lots more trace, probably irrelevant)
Then, you might ask, why don't you just not use the library? Well... This is the code that lists the objects in a bucket:
StorageOptions
.newBuilder()
.setCredentials(ServiceAccountCredentials.fromStream(
File(jsonKeyFile).inputStream()))
.build()
.getService
.list(bucket)
.getValues
.asScala
.map(irrelevant)
.toSeq
.toDF("irrelevant")
And I have not yet found a way to do this easily without the specified library.
I found out what caused the problem. Guava:27.1-android was a dependency of some library at some point, I don't know which and how it got there, but it was in use. In this version of Guava, the Supplier interface does not extend the Java Supplier interface.
I fixed it by adding Guava 27.1-jre to my dependencies. I don't know if the order matters, but I don't dare touch anything at this point. Here is where I placed it:
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.1" % "provided"
libraryDependencies += "com.google.guava" % "guava" % "27.1-jre"
libraryDependencies += "com.google.cloud" % "google-cloud-storage" % "1.70.0"
//BQ samples as of 27feb2019 use hadoop2 but hadoop3 seems to work fine and are recommended elsewhere
libraryDependencies += "com.google.cloud.bigdataoss" % "bigquery-connector" % "hadoop3-0.13.16" % "provided"
libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop3-1.9.16" % "provided"
Hope this prevents some other poor soul from spending 2 days on this bs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With