Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"No Filesystem for Scheme: gs" when running spark job locally

I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)

When running the job locally on my Mac machine, I am getting the following error:

5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs

I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>

I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:

<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>

, and Hadoop 1.2.1 as follows:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>

The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

like image 442
Yaniv Donenfeld Avatar asked Jan 05 '15 15:01

Yaniv Donenfeld


3 Answers

In Scala, add the following config when setting your hadoopConfiguration:

val conf = sc.hadoopConfiguration
conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
like image 75
Art Haedike Avatar answered Nov 02 '22 09:11

Art Haedike


There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:

  1. Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:

    ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
    

Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html

  1. Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:

    export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
    

Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html

like image 2
Dennis Huo Avatar answered Nov 02 '22 11:11

Dennis Huo


I can't say what's wrong, but here's what I would try.

  • Try setting fs.gs.project.id: <property><name>fs.gs.project.id</name><value>my-little-project</value></property>
  • Print sc.hadoopConfiguration.get(fs.gs.impl) to make sure your core-site.xml is getting loaded. Print it in the driver and also in the executor: println(x); rdd.foreachPartition { _ => println(x) }
  • Make sure the GCS jar is sent to the executors (sparkConf.setJars(...)). I don't think this would matter in local mode (it's all one JVM, right?) but you never know.

Nothing but your program needs to be restarted. There is no Hadoop process. In local and standalone modes Spark only uses Hadoop as a library, and only for IO I think.

like image 1
Daniel Darabos Avatar answered Nov 02 '22 09:11

Daniel Darabos