Apache Spark Running Locally Giving Refused Connection Error

Tags:

hadoop

I have Spark and Hadoop installed on OS X. I successfully worked through an example where Hadoop ran locally, had files stored in HDFS and I ran spark with

spark-shell --master yarn-client

and from within the shell worked with HDFS. I'm having problems, however, trying to get Spark to run without HDFS, just locally on my machine. I looked at this answer but it doesn't feel right messing around with environment variables when the Spark documentation says

It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

If I run the basic SparkPi example I get the correct output.

If I try run the sample Java app they provide, again, I get output, but this time with connection refused errors relating to port 9000, which sounds like it's trying to connect to Hadoop, but I don't know why because I'm not specifying that

    $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] ~/study/scala/sampleJavaApp/target/simple-project-1.0.jar
    Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...
...
...
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
        at org.apache.hadoop.ipc.Client$Connection.access(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
        at org.apache.hadoop.ipc.Client.call(Client.java:1381)
        ... 51 more
    15/07/31 11:05:06 INFO spark.SparkContext: Invoking stop() from shutdown hook
    15/07/31 11:05:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
...
...
...
    15/07/31 11:05:06 INFO ui.SparkUI: Stopped Spark web UI at http://10.37.2.37:4040
    15/07/31 11:05:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler
    15/07/31 11:05:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    15/07/31 11:05:06 INFO util.Utils: path = /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf/blockmgr-b66cc31e-7371-472f-9886-4cd33d5ba4b1, already present as root for deletion.
    15/07/31 11:05:06 INFO storage.MemoryStore: MemoryStore cleared
    15/07/31 11:05:06 INFO storage.BlockManager: BlockManager stopped
    15/07/31 11:05:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    15/07/31 11:05:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    15/07/31 11:05:06 INFO spark.SparkContext: Successfully stopped SparkContext
    15/07/31 11:05:06 INFO util.Utils: Shutdown hook called
    15/07/31 11:05:06 INFO util.Utils: Deleting directory /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf

Any pointers/explanations as to where I'm going wrong would be much appreciated!

UPDATE

It seems that the fact I have the environment variable HADOOP_CONF_DIR set is causing some issues. Under that directory, I have core-site.xml which contains the following

<property>
     <name>fs.default.name</name>                                     
     <value>hdfs://localhost:9000</value>                             
</property>

If I change the value e.g. <value>hdfs://localhost:9100</value> then when I attempt to run the spark job, the connection refused error refers to this changed port

Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9100 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

So for some reason, despite instructing it to run locally, it is trying to connect to HDFS. If I remove the HADOOP_CONF_DIR environment variable, the job works fine.

547

asked Jul 31 '15 10:07

Philip O'Brien

2 Answers

Apache Spark uses the Hadoop client libraries for file access when you use sc.textFile. This makes it possible to use an hdfs:// or s3n:// path for example. You can also use local paths as file:/home/robocode/foo.txt.

If you specify a file name without a schema, fs.default.name is used. It defaults to file:, but you explicitly override it to hdfs://localhost:9000 in your core-site.xml. So if you don't specify the schema, it's trying to read from HDFS.

The simplest solution is to specify the schema:

JavaRDD<String> logData = sc.textFile("file:/home/robocode/foo.txt").cache();

answered Sep 30 '22 14:09

Daniel Darabos

I had the same error, HADOOP_CONF_DIR was defined, so I just unset the environment variable.

unset HADOOP_CONF_DIR

answered Sep 30 '22 13:09

Germán

Related questions
                            
                                localhost: ERROR: Cannot set priority of datanode process 32156
                            
                                Hadoop on Kubernetes vs Standard Hadoop
                            
                                java.io.IOException: Incompatible clusterIDs
                            
                                how to order my tuple of spark results descending order using value
                            
                                Setting YARN queue in PySpark
                            
                                CAP with distributed System
                            
                                How to copy first few lines of a large file in hadoop to a new file?
                            
                                Could you give me any clue Why 'Cannot call methods on a stopped SparkContext'?
                            
                                How to find Hadoop hdfs directory on my system?
                            
                                Running jobs parallely in hadoop
                            
                                How to import org.apache Java dependencies w/ or w/o Maven
                            
                                dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured
                            
                                Is it possible to import data into Hive table without copying the data
                            
                                How can I pre split in hbase
                            
                                Avoid creation of _$folder$ keys in S3 with hadoop (EMR)
                            
                                Run hadoop in the Mac OS
                            
                                How to Practice Hadoop Programming? [closed]
                            
                                Error in pig while loading data
                            
                                What does the following fields: 'totalSize' and 'rawDataSize' mean in DESCRIBE EXTENDED query output in hive?
                            
                                How to calculate seconds between two timestamps in Impala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With