I have Spark and Hadoop installed on OS X. I successfully worked through an example where Hadoop ran locally, had files stored in HDFS and I ran spark with
spark-shell --master yarn-client
and from within the shell worked with HDFS. I'm having problems, however, trying to get Spark to run without HDFS, just locally on my machine. I looked at this answer but it doesn't feel right messing around with environment variables when the Spark documentation says
It’s easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.
If I run the basic SparkPi
example I get the correct output.
If I try run the sample Java app they provide, again, I get output, but this time with connection refused errors relating to port 9000, which sounds like it's trying to connect to Hadoop, but I don't know why because I'm not specifying that
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] ~/study/scala/sampleJavaApp/target/simple-project-1.0.jar
Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
...
...
...
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
at org.apache.hadoop.ipc.Client$Connection.access(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1462)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
... 51 more
15/07/31 11:05:06 INFO spark.SparkContext: Invoking stop() from shutdown hook
15/07/31 11:05:06 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
...
...
...
15/07/31 11:05:06 INFO ui.SparkUI: Stopped Spark web UI at http://10.37.2.37:4040
15/07/31 11:05:06 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/31 11:05:06 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/07/31 11:05:06 INFO util.Utils: path = /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf/blockmgr-b66cc31e-7371-472f-9886-4cd33d5ba4b1, already present as root for deletion.
15/07/31 11:05:06 INFO storage.MemoryStore: MemoryStore cleared
15/07/31 11:05:06 INFO storage.BlockManager: BlockManager stopped
15/07/31 11:05:06 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/07/31 11:05:06 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/07/31 11:05:06 INFO spark.SparkContext: Successfully stopped SparkContext
15/07/31 11:05:06 INFO util.Utils: Shutdown hook called
15/07/31 11:05:06 INFO util.Utils: Deleting directory /private/var/folders/cg/vkq1ghks37lbflpdg0grq7f80000gn/T/spark-c6ba18f5-17a5-4da9-864c-509ec855cadf
Any pointers/explanations as to where I'm going wrong would be much appreciated!
It seems that the fact I have the environment variable HADOOP_CONF_DIR
set is causing some issues. Under that directory, I have core-site.xml
which contains the following
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
If I change the value e.g. <value>hdfs://localhost:9100</value>
then when I attempt to run the spark job, the connection refused error refers to this changed port
Exception in thread "main" java.net.ConnectException: Call From 37-2-37-10.tssg.org/10.37.2.37 to localhost:9100 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
So for some reason, despite instructing it to run locally, it is trying to connect to HDFS. If I remove the HADOOP_CONF_DIR
environment variable, the job works fine.
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
YARN is a generic resource-management framework for distributed workloads; in other words, a cluster-level operating system. Although part of the Hadoop ecosystem, YARN can support a lot of varied compute-frameworks (such as Tez, and Spark) in addition to MapReduce.
Apache Spark uses the Hadoop client libraries for file access when you use sc.textFile
. This makes it possible to use an hdfs://
or s3n://
path for example. You can also use local paths as file:/home/robocode/foo.txt
.
If you specify a file name without a schema, fs.default.name
is used. It defaults to file:
, but you explicitly override it to hdfs://localhost:9000
in your core-site.xml
. So if you don't specify the schema, it's trying to read from HDFS.
The simplest solution is to specify the schema:
JavaRDD<String> logData = sc.textFile("file:/home/robocode/foo.txt").cache();
I had the same error, HADOOP_CONF_DIR was defined, so I just unset the environment variable.
unset HADOOP_CONF_DIR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With