Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write to HDFS running in Docker from another Docker container running Spark

Tags:

I have a docker image for spark + jupyter (https://github.com/zipfian/spark-install)

I have another docker image for hadoop. (https://github.com/kiwenlau/hadoop-cluster-docker)

I am running 2 containers from the above 2 images in Ubuntu. For the first container: I am able to successfully launch jupyter and run python code:

   import pyspark
   sc = pyspark.sparkcontext('local[*]')
   rdd = sc.parallelize(range(1000))
   rdd.takeSample(False,5)

For the second container:

In the host Ubuntu OS, I am able to successfully go to the

  • web browser localhost:8088 : And browse the Hadoop all applications
  • localhost:50070: and browse the HDFS file system.

enter image description here

Now I want to write to the HDFS file system (running in the 2nd container) from jupyter (running in the first container).

So I add the additional line

rdd.saveAsTextFile("hdfs:///user/root/input/test")

I get the error:

HDFS URI, no host: hdfs:///user/root/input/test

Am I giving the hdfs path incorrectly ?

My understanding is that, I should be able to talk to a docker container running hdfs from another container running spark. Am I missing anything ?

Thanks for your time.

I haven't tried docker compose yet.

like image 486
VenVig Avatar asked Oct 06 '17 20:10

VenVig


1 Answers

The URI hdfs:///user/root/input/test is missing an authority (hostname) section and port. To write to hdfs in another container you would need to fully specify the URI and make sure the two containers were on the same network and that the HDFS container has the ports for the namenode and data node exposed.

For example, you might have set the host name for the HDFS container to be hdfs.container. Then you can write to that HDFS instance using the URI hdfs://hdfs.container:8020/user/root/input/test (assuming the Namenode is running on 8020). Of course you will also need to make sure that the path you're seeking to write has the correct permissions as well.

So to do what you want:

  • Make sure your HDFS container has the namenode and datanode ports exposed. You can do this using an EXPOSE directive in the dockerfile (the container you linked does not have these) or using the --expose argument when invoking docker run. The default ports are 8020 and 50010 (for NN and DN respectively).
  • Start the containers on the same network. If you just do docker run with no --network they will start on the default network and you'll be fine. Start the HDFS container with a specific name using the --name argument.
  • Now modify your URI to include the proper authority (this will be the value of the docker --name argument you passed) and port as described above and it should work
like image 145
Ed Kohlwey Avatar answered Oct 11 '22 12:10

Ed Kohlwey