I have a docker image for spark + jupyter (https://github.com/zipfian/spark-install)
I have another docker image for hadoop. (https://github.com/kiwenlau/hadoop-cluster-docker)
I am running 2 containers from the above 2 images in Ubuntu. For the first container: I am able to successfully launch jupyter and run python code:
import pyspark
sc = pyspark.sparkcontext('local[*]')
rdd = sc.parallelize(range(1000))
rdd.takeSample(False,5)
For the second container:
In the host Ubuntu OS, I am able to successfully go to the
Now I want to write to the HDFS file system (running in the 2nd container) from jupyter (running in the first container).
So I add the additional line
rdd.saveAsTextFile("hdfs:///user/root/input/test")
I get the error:
HDFS URI, no host: hdfs:///user/root/input/test
Am I giving the hdfs path incorrectly ?
My understanding is that, I should be able to talk to a docker container running hdfs from another container running spark. Am I missing anything ?
Thanks for your time.
I haven't tried docker compose yet.
The URI hdfs:///user/root/input/test
is missing an authority (hostname) section and port. To write to hdfs in another container you would need to fully specify the URI and make sure the two containers were on the same network and that the HDFS container has the ports for the namenode and data node exposed.
For example, you might have set the host name for the HDFS container to be hdfs.container
. Then you can write to that HDFS instance using the URI hdfs://hdfs.container:8020/user/root/input/test
(assuming the Namenode is running on 8020). Of course you will also need to make sure that the path you're seeking to write has the correct permissions as well.
So to do what you want:
EXPOSE
directive in the dockerfile (the container you linked does not have these) or using the --expose
argument when invoking docker run
. The default ports are 8020 and 50010 (for NN and DN respectively).docker run
with no --network
they will start on the default network and you'll be fine. Start the HDFS container with a specific name using the --name
argument.--name
argument you passed) and port as described above and it should workIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With