The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://
In my case on Spark 2.4.3 I needed to do the following to enable GCS access from Spark local. I used a JSON keyfile vs. the client.id/secret
proposed above.
In $SPARK_HOME/jars/
, use the shaded gcs-connector
jar from here: http://repo2.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-1.9.17/ or else I had various failures with transitive dependencies.
(Optional) To my build.sbt
add:
"com.google.cloud.bigdataoss" % "gcs-connector" % "hadoop2-1.9.17"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri")
In $SPARK_HOME/conf/spark-defaults.conf
, add:
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile /path/to/my/keyfile
And everything is working.
I am submitting here the solution I have come up with by combining different resources:
Download the google cloud storage connector : gs-connector and store it in $SPARK/jars/
folder (Check Alternative 1 at the bottom)
Download the core-site.xml
file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).
Store the core-site.xml
file in a folder. Personally I create the $SPARK/conf/hadoop/conf/
folder and store it there.
In the spark-env.sh file indicate the hadoop conf fodler by adding the following line: export HADOOP_CONF_DIR=
=</absolute/path/to/hadoop/conf/>
Create an OAUTH2 key from the respective page of Google (Google Console-> API-Manager-> Credentials
).
Copy the credentials to the core-site.xml
file.
Alternative 1: Instead of copying the file to the $SPARK/jars
folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH
in the spark-env.sh``folder but
SPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath
<configuration>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>Register GCS Hadoop filesystem</description>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>false</value>
<description>Force OAuth2 flow</description>
</property>
<property>
<name>fs.gs.auth.client.id</name>
<value>32555940559.apps.googleusercontent.com</value>
<description>Client id of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.auth.client.secret</name>
<value>fslkfjlsdfj098ejkjhsdf</value>
<description>Client secret of Google-managed project associated with the Cloud SDK</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value>_THIS_VALUE_DOES_NOT_MATTER_</value>
<description>This value is required by GCS connector, but not used in the tools provided here.
The value provided is actually an invalid project id (starts with `_`).
</description>
</property>
</configuration>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With