Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark Running Locally Giving Refused Connection Error

I have Spark and Hadoop installed on OS X. I successfully worked through an example where Hadoop ran locally, had files stored in HDFS and I ran spark with

spark-shell --master yarn-client

and from within the shell worked with HDFS. I'm having problems, however, trying to get Spark to run without HDFS, just locally on my machine. I looked at this answer but it doesn't feel right messing around with environment variables when the Spark documentation says

It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

If I run the basic SparkPi example I get the correct output.

If I try run the sample Java app they provide, again, I get output, but this time with connection refused errors relating to port 9000, which sounds like it's trying to connect to Hadoop, but I don't know why because I'm not specifying that

    $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] ~/study/scala/sampleJavaApp/target/simple-project-1.0.jar
    Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...
...
...
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
        at org.apache.hadoop.ipc.Client$Connection.access(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        ... 51 more
    15/07/31 11:05:06 INFO spark.SparkContext: Invoking stop() from shutdown hook
    15/07/31 11:05:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
...
...
...
    15/07/31 11:05:06 INFO ui.SparkUI: Stopped Spark web UI at http://10.37.2.37:4040
    15/07/31 11:05:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler
    15/07/31 11:05:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    15/07/31 11:05:06 INFO util.Utils: path = /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf/blockmgr-b66cc31e-7371-472f-9886-4cd33d5ba4b1, already present as root for deletion.
    15/07/31 11:05:06 INFO storage.MemoryStore: MemoryStore cleared
    15/07/31 11:05:06 INFO storage.BlockManager: BlockManager stopped
    15/07/31 11:05:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    15/07/31 11:05:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    15/07/31 11:05:06 INFO spark.SparkContext: Successfully stopped SparkContext
    15/07/31 11:05:06 INFO util.Utils: Shutdown hook called
    15/07/31 11:05:06 INFO util.Utils: Deleting directory /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf

Any pointers/explanations as to where I'm going wrong would be much appreciated!


UPDATE

It seems that the fact I have the environment variable HADOOP_CONF_DIR set is causing some issues. Under that directory, I have core-site.xml which contains the following

<property>
     <name>fs.default.name</name>                                     
     <value>hdfs://localhost:9000</value>                             
</property> 

If I change the value e.g. <value>hdfs://localhost:9100</value> then when I attempt to run the spark job, the connection refused error refers to this changed port

Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9100 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused 

So for some reason, despite instructing it to run locally, it is trying to connect to HDFS. If I remove the HADOOP_CONF_DIR environment variable, the job works fine.

like image 547
Philip O'Brien Avatar asked Jul 31 '15 10:07

Philip O'Brien


People also ask

How to run Spark in cluster mode?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

What is YARN in Spark?

YARN is a generic resource-management framework for distributed workloads; in other words, a cluster-level operating system. Although part of the Hadoop ecosystem, YARN can support a lot of varied compute-frameworks (such as Tez, and Spark) in addition to MapReduce.


2 Answers

Apache Spark uses the Hadoop client libraries for file access when you use sc.textFile. This makes it possible to use an hdfs:// or s3n:// path for example. You can also use local paths as file:/home/robocode/foo.txt.

If you specify a file name without a schema, fs.default.name is used. It defaults to file:, but you explicitly override it to hdfs://localhost:9000 in your core-site.xml. So if you don't specify the schema, it's trying to read from HDFS.

The simplest solution is to specify the schema:

JavaRDD<String> logData = sc.textFile("file:/home/robocode/foo.txt").cache();
like image 53
Daniel Darabos Avatar answered Sep 30 '22 14:09

Daniel Darabos


I had the same error, HADOOP_CONF_DIR was defined, so I just unset the environment variable.

unset HADOOP_CONF_DIR
like image 26
Germán Avatar answered Sep 30 '22 13:09

Germán