I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)
When running the job locally on my Mac machine, I am getting the following error:
5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs
I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:
<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>
I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:
<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>
, and Hadoop 1.2.1 as follows:
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>
The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.
In Scala, add the following config when setting your hadoopConfiguration:
val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:
Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:
ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html
Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:
export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html
I can't say what's wrong, but here's what I would try.
fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With