Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Docker service stuck in New state (Swarm)

I'm facing a strange issue with my Docker Swarm (a cluster of 3 managers and 5 workers). I have many running services right now and when I approach around 100 services (and with replications more than 110 services), the new services I want to run won't start.

When I list the service, I have this:

ID            NAME            IMAGE       NODE  DESIRED STATE  CURRENT STATE     ERROR  PORTS
alam7whfn1xe  service_name.1  some_image        Running        New 22 hours ago

You can see CURRENT STATE == New 22 hours ago. If I try to inspect the logs, they're empty. Inspecting the service won't help either (nothing relevant).

If I stop some of my services, the service tagged with New state may start by itself after the first retry. It seems that I reached a limit by any way.

I followed up some documentation on the web and there is nothing clear about this issue. You'll be welcome if you can point me some links.

Today, in my opinion, I suspect that the networks I created in the Swarm (--driver=overlay) have an insufficient IP range and can't give enough IP to containers. These networks are /24 subnets. Is there any way to "flush" the IP reservations in order to re-initialize the networks without recreation Docker networks?

After investigation, there are two types of services that can reach this New state and they're on 2 same networks.

The result of docker network inspect:

[
    {
        "Name": "network_name",
        "Id": "okbrl5twyheq32ht3zw5l00gs",
        "Created": "0001-01-01T00:00:00Z", <- this is the real date, strange isn't it?
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.16.2.0/24",
                    "Gateway": "172.16.2.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
         "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4097"
        },
        "Labels": null
    }
]

Additionnaly, docker version:

Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 20:00:06 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   cec0b72
 Built:        Tue Sep  5 19:58:57 2017
 OS/Arch:      linux/amd64
 Experimental: false

N.B.: I don't want to update Docker in this moment.

EDIT 1:

I read again the Docker documentation about networks and they are mentionning an opened issue on Moby's Github Project Swarm Mode at Scale #30820.

Overlay network limitations

You should create overlay networks with /24 blocks (the default), which limits you to 256 IP addresses, when you create networks using the default VIP-based endpoint-mode. This recommendation addresses limitations with swarm mode. If you need more than 256 IP addresses, do not increase the IP block size. You can either use dnsrr endpoint mode with an external load balancer, or use multiple smaller overlay networks. See Configure service discovery for more information about different endpoint modes.

-- https://docs.docker.com/engine/reference/commandline/network_create/#overlay-network-limitations

EDIT 2:

Based on Flavio 'fcrisciani' Crisciani's comment on the issue Swarm Mode at Scale #30820, I'll try to add the option --endpoint-mode=dnsrr on my services.

like image 626
Paul Rey Avatar asked Jun 28 '18 09:06

Paul Rey


4 Answers

Each service and task gets IP address so the overlay network that the services get connected should have subnet that can support enough ip addresses.

Use following command to create docker network with larger range of supported IPs:

docker network create --driver=overlay --subnet=10.10.0.0/16 <network_name>

Reference: https://github.com/docker/for-aws/issues/104#issuecomment-331563445 https://docs.docker.com/engine/reference/commandline/network_create/

like image 144
Pratik Avatar answered Oct 23 '22 06:10

Pratik


The option --endpoint-mode=dnsrr on every services seems to solve this issue.

like image 27
Paul Rey Avatar answered Oct 23 '22 07:10

Paul Rey


It seems the resource limitation of overlay IP address amount these swarm tasks. You could create a Docker network with larger range of subnet, like 10.10.0.0/16. Then, use it in your compose file to create a service. I think this could resolve this problem.

like image 4
dodo Hsu Avatar answered Oct 23 '22 06:10

dodo Hsu


This is typically caused by running out of IP addresses. You can increase the number of available addresses by running:

docker swarm init --default-addr-pool-mask-length 16 --force-new-cluster

This command keeps all your existing services running, but it is of course a good idea to do backup first:

https://docs.docker.com/engine/swarm/admin_guide/#Back%20up%20the%20swarm

like image 2
PHZ.fi-Pharazon Avatar answered Oct 23 '22 06:10

PHZ.fi-Pharazon