Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make HDFS work in docker swarm

I have troubles to make my HDFS setup work in docker swarm. To understand the problem I've reduced my setup to the minimum :

  • 1 physical machine
  • 1 namenode
  • 1 datanode

This setup is working fine with docker-compose, but it fails with docker-swarm, using the same compose file.

Here is the compose file :

version: '3'
services:
  namenode:
      image: uhopper/hadoop-namenode
      hostname: namenode
      ports:
        - "50070:50070"
        - "8020:8020"
      volumes:
        - /userdata/namenode:/hadoop/dfs/name
      environment:
        - CLUSTER_NAME=hadoop-cluster

  datanode:
    image: uhopper/hadoop-datanode
    depends_on:
      - namenode
    volumes:
      - /userdata/datanode:/hadoop/dfs/data
    environment:
      - CORE_CONF_fs_defaultFS=hdfs://namenode:8020

To test it, I have installed an hadoop client on my host (physical) machine with only this simple configuration in core-site.xml :

<configuration>
  <property><name>fs.defaultFS</name><value>hdfs://0.0.0.0:8020</value></property>
</configuration>

Then I run the following command :

hdfs dfs -put test.txt /test.txt

With docker-compose (just running docker-compose up) it's working and the file is written in HDFS.

With docker-swarm, I'm running :

docker swarm init 
docker stack deploy --compose-file docker-compose.yml hadoop

Then when all services are up, I put my file on HDFS it fails like this :

INFO hdfs.DataStreamer: Exception in createBlockOutputStream
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/x.x.x.x:50010]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.hdfs.DataStreamer.createSocketForPipeline(DataStreamer.java:259)
        at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1692)
        at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1648)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
18/06/14 17:29:41 WARN hdfs.DataStreamer: Abandoning BP-1801474405-10.0.0.4-1528990089179:blk_1073741825_1001
18/06/14 17:29:41 WARN hdfs.DataStreamer: Excluding datanode DatanodeInfoWithStorage[10.0.0.6:50010,DS-d7d71735-7099-4aa9-8394-c9eccc325806,DISK]
18/06/14 17:29:41 WARN hdfs.DataStreamer: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and 1 node(s) are excluded in this operation.

If I look in the web UI the datanode seems to be up and no issue is reported...

Update : it seems that dependsOn is ignored by swarm, but it does not seem to be the cause of my problem : I've restarted the datanode when the namenode is up but it did not work better.

Thanks for your help :)

like image 966
Loic Avatar asked Jun 14 '18 15:06

Loic


People also ask

Can I run Hadoop on docker?

You can run Hadoop in Docker container. Now, You have to bind java & hadoop files to your docker image.

Does docker swarm have Load Balancer?

The Docker Swarm mode allows an easy and fast load balancing setup with minimal configuration. Even though the swarm itself already performs a level of load balancing with the ingress mesh, having an external load balancer makes the setup simple to expand upon.

Is docker swarm mode deprecated?

Docker Swarm is not being deprecated, and is still a viable method for Docker multi-host orchestration, but Docker Swarm Mode (which uses the Swarmkit libraries under the hood) is the recommended way to begin a new Docker project where orchestration over multiple hosts is required.


1 Answers

The whole mess stems from interaction between docker swarm using overlay networks and how the HDFS name node keeps track of its data nodes. The namenode records the datanode IPs/hostnames based the datanode's overlay network IPs. When the HDFS client asks for read/write operations directly on the datanodes, the namenode reports back the IPs/hostnames of the datanodes based on the overlay network. Since the overlay network is not accessible to the external clients, any rw operations will fail.

The final solution (after lots of struggling to get overlay network to work) I used was to have the HDFS services use the host network. Here's a snippet from the compose file:

version: '3.7'

x-deploy_default: &deploy_default
  mode: replicated
  replicas: 1
  placement:
    constraints:
      - node.role == manager
  restart_policy:
    condition: any
    delay: 5s

services:
  hdfs_namenode:
    deploy:
      <<: *deploy_default
    networks:
      hostnet: {}
    volumes:
      - hdfs_namenode:/hadoop-3.2.0/var/name_node
    command:
      namenode -fs hdfs://${PRIMARY_HOST}:9000
    image: hadoop:3.2.0

  hdfs_datanode:
    deploy:
      mode: global
    networks:
      hostnet: {}
    volumes:
      - hdfs_datanode:/hadoop-3.2.0/var/data_node
    command:
      datanode -fs hdfs://${PRIMARY_HOST}:9000
    image: hadoop:3.2.0
volumes:
  hdfs_namenode:
  hdfs_datanode:

networks:
  hostnet:
    external: true
    name: host
like image 187
ftzeng12 Avatar answered Oct 02 '22 17:10

ftzeng12