Logo Questions Linux Laravel Mysql Ubuntu Git Menu

docker swarm not restarting unhealthy selenium hub containers

I have selenium grid deployed with docker swarm.


version: '3.7'

   image: selenium/hub:3.141.59-mercury
     - "4444:4444"
     - /dev/shm:/dev/shm
   privileged: true
     HUB_HOST: hub
     HUB_PORT: 4444
         memory: 5000M
       condition: on-failure
       window: 240s
     test: ["CMD", "curl", "-I", ""]
     interval: 1m
     timeout: 60s
     retries: 3
     start_period: 300s

    image:  selenium/node-chrome:latest
      - /dev/shm:/dev/shm
    privileged: true
      HUB_HOST: hub
      HUB_PORT: 4444
          memory: 2800M
      replicas: 10
    entrypoint: bash -c 'SE_OPTS="-host $$HOSTNAME" /opt/bin/entry_point.sh'

The problem is that when hub's status is unhealthy, swarm almost never restarts it. just a few times i've noticed that it was successfully restarted. As far as i understand, it should keep restarting until the healthcheck succeed or forever, however the container is just running in unhealthy state.

I tried excluding restart_policy completely in case it is messing up with the swarm mode, but no effect.

In addition: it seems like the chrome container(all replicas) restart when hub successfully restarted. the relation is not specified in docker-compose.yml, how come this is happening?

what could be wrong with my setup?


When i try to inspect a container(after the status is unhealthy and no more restart retries made) as for example docker container inspect $container_id --format '{{json .State.Health}}' | jq . or almost any other function on a container, it fails with this output:

docker container inspect 1abfa546cc26 --format '{{json .State.Health}}' | jq .
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fa114765fff m=0 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7fa114765fff
stack: frame={sp:0x7ffe5e0f1a08, fp:0x0} stack=[0x7ffe5d8f2fc8,0x7ffe5e0f1ff0)
00007ffe5e0f1908:  73752f3a6e696273  732f3a6e69622f72 
00007ffe5e0f1918:  6e69622f3a6e6962  2a3a36333b30303d 
00007ffe5e0f1928:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f1938:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f1948:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f1958:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f1968:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f1978:  7375706f2e2a3a36  2a3a36333b30303d 
00007ffe5e0f1988:  3b30303d7870732e  0000000000000000 
00007ffe5e0f1998:  3a36333b30303d66  2a3a36333b30303d 
00007ffe5e0f19a8:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f19b8:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f19c8:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f19d8:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f19e8:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f19f8:  7375706f2e2a3a36  0000000000000002 
00007ffe5e0f1a08: <8000000000000006  fffffffe7fffffff 
00007ffe5e0f1a18:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a28:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a38:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a48:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a58:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a68:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a78:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a88:  ffffffffffffffff  00007fa114acd6e0 
00007ffe5e0f1a98:  00007fa11476742a  0000000000000020 
00007ffe5e0f1aa8:  0000000000000000  0000000000000000 
00007ffe5e0f1ab8:  0000000000000000  0000000000000000 
00007ffe5e0f1ac8:  0000000000000000  0000000000000000 
00007ffe5e0f1ad8:  0000000000000000  0000000000000000 
00007ffe5e0f1ae8:  0000000000000000  0000000000000000 
00007ffe5e0f1af8:  0000000000000000  0000000000000000 
runtime: unknown pc 0x7fa114765fff
stack: frame={sp:0x7ffe5e0f1a08, fp:0x0} stack=[0x7ffe5d8f2fc8,0x7ffe5e0f1ff0)
00007ffe5e0f1908:  73752f3a6e696273  732f3a6e69622f72 
00007ffe5e0f1918:  6e69622f3a6e6962  2a3a36333b30303d 
00007ffe5e0f1928:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f1938:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f1948:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f1958:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f1968:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f1978:  7375706f2e2a3a36  2a3a36333b30303d 
00007ffe5e0f1988:  3b30303d7870732e  0000000000000000 
00007ffe5e0f1998:  3a36333b30303d66  2a3a36333b30303d 
00007ffe5e0f19a8:  3b30303d616b6d2e  33706d2e2a3a3633 
00007ffe5e0f19b8:  2a3a36333b30303d  3b30303d63706d2e 
00007ffe5e0f19c8:  67676f2e2a3a3633  2a3a36333b30303d 
00007ffe5e0f19d8:  333b30303d61722e  3d7661772e2a3a36 
00007ffe5e0f19e8:  2e2a3a36333b3030  333b30303d61676f 
00007ffe5e0f19f8:  7375706f2e2a3a36  0000000000000002 
00007ffe5e0f1a08: <8000000000000006  fffffffe7fffffff 
00007ffe5e0f1a18:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a28:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a38:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a48:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a58:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a68:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a78:  ffffffffffffffff  ffffffffffffffff 
00007ffe5e0f1a88:  ffffffffffffffff  00007fa114acd6e0 
00007ffe5e0f1a98:  00007fa11476742a  0000000000000020 
00007ffe5e0f1aa8:  0000000000000000  0000000000000000 
00007ffe5e0f1ab8:  0000000000000000  0000000000000000 
00007ffe5e0f1ac8:  0000000000000000  0000000000000000 
00007ffe5e0f1ad8:  0000000000000000  0000000000000000 
00007ffe5e0f1ae8:  0000000000000000  0000000000000000 
00007ffe5e0f1af8:  0000000000000000  0000000000000000 

goroutine 1 [running, locked to thread]:
    /usr/local/go/src/runtime/asm_amd64.s:311 fp=0xc00009c720 sp=0xc00009c718 pc=0x565171ddf910
runtime.newproc(0x565100000000, 0x56517409ab70)
    /usr/local/go/src/runtime/proc.go:3243 +0x71 fp=0xc00009c768 sp=0xc00009c720 pc=0x565171dbdea1
    /usr/local/go/src/runtime/proc.go:239 +0x37 fp=0xc00009c788 sp=0xc00009c768 pc=0x565171db6447
    <autogenerated>:1 +0x6a fp=0xc00009c798 sp=0xc00009c788 pc=0x565171ddf5ba
    /usr/local/go/src/runtime/proc.go:147 +0xc2 fp=0xc00009c7e0 sp=0xc00009c798 pc=0x565171db6132
    /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00009c7e8 sp=0xc00009c7e0 pc=0x565171de1a11

rax    0x0
rbx    0x6
rcx    0x7fa114765fff
rdx    0x0
rdi    0x2
rsi    0x7ffe5e0f1990
rbp    0x5651736b13d5
rsp    0x7ffe5e0f1a08
r8     0x0
r9     0x7ffe5e0f1990
r10    0x8
r11    0x246
r12    0x565175ae21a0
r13    0x11
r14    0x565173654be8
r15    0x0
rip    0x7fa114765fff
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

to resolve it, i did try to apply this solution: https://success.docker.com/article/how-to-reserve-resource-temporarily-unavailable-errors-due-to-tasksmax-setting

however it doesn't affect anything, thus, i guess the reason is different.

journalctl -u docker is just full of this log:

 level=warning msg="Health check for container c427cfd49214d394cee8dd2c9019f6f319bc6637cfb53f0c14de70e1147b5fa6 error: context deadline exceeded"
like image 586
user1935987 Avatar asked May 19 '20 12:05


1 Answers

First, I'd leave the restart_policy out. Swarm mode will recover a failed container for you, and this policy is handled outside of swarm mode and could result in unexpected behavior. Next, to debug a healthcheck, since you have configured it with multiple retries, timeouts, and a start period, is to inspect the container. E.g. you can run the following:

docker container inspect $container_id --format '{{json .State.Health}}' | jq .

The output from that will show the current status of the container, including a log of any healthcheck results over time. If that shows the container is failing for more than 3 retries and unhealthy, then check the service state:

docker service inspect $service_name --format '{{json .UpdateStatus}}' | jq .

That should show whether there is currently an update in progress, whether the rollout of a change has resulted in any issues.

One other thing to look at is the memory limit. Without a corresponding memory reservation, the scheduler may be using the limit as a reservation (I'd need to test this) and if you don't have 10G of memory available that hasn't been reserved by other containers, the scheduler may fail to reschedule the service. The easy solution to this is to specify a smaller reservation that you want to ensure is always available on the node when scheduling the containers. E.g.

         memory: 5000M
         memory: 1000M

Based on the latest debugging output:

docker container inspect 1abfa546cc26 --format '{{json .State.Health}}' | jq .
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7fa114765fff m=0 sigcode=18446744073709551610

This suggests the host itself is causing the issues, or the docker engine, and not your container's configuration. If you haven't already, I'd ensure that you are running the most recent stable release from docker. At last check, that's 19.03.9. I'd check other OS logs in /var/log/ for any other errors on the host. I'd check for resource limits being reached, things like memory, and any process/thread related sysctl settings (e.g. kernel.pid_max). With docker I also recommend keeping your kernel and systemd versions updated, and reboot after and update to those for the changes to apply.

I'd also recommend reviewing this unix.se post on the same error that has a few other things to try.

If none of those help, you can contribute details to reproduce your scenario to similar open issues at:

  • https://github.com/docker/for-linux/issues/343
  • https://github.com/golang/go/issues/24484
like image 155
BMitch Avatar answered Sep 28 '22 15:09
