Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K8S Pod with startupProbe and initialDelaySeconds specified waits too long to become Ready

I have been trying to debug a very odd delay in my K8S deployments. I have tracked it down to the simple reproduction below. What it appears is that if I set an initialDelaySeconds on a startup probe or leave it 0 and have a single failure, then the probe doesn't get run again for a while and ends up with atleast a 1-1.5 minute delay getting into Ready:true state.

I am running locally with Ubutunu 18.04 and microk8s v1.19.3 with the following versions:

  • kubelet: v1.19.3-34+a56971609ff35a
  • kube-proxy: v1.19.3-34+a56971609ff35a
  • containerd://1.3.7
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: microbot
  name: microbot
spec:
  replicas: 1
  selector:
    matchLabels:
      app: microbot
  strategy: {}
  template:
    metadata:
      labels:
        app: microbot
    spec:
      containers:
      - image: cdkbot/microbot-amd64
        name: microbot
        command: ["/bin/sh"]
        args: ["-c", "sleep 3; /start_nginx.sh"]
        #args: ["-c", "/start_nginx.sh"]
        ports:
        - containerPort: 80
        startupProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 0  # 5 also has same issue
          periodSeconds: 1
          failureThreshold: 10
          successThreshold: 1
        ##livenessProbe:
        ##  httpGet:
        ##    path: /
        ##    port: 80
        ##  initialDelaySeconds: 0
        ##  periodSeconds: 10
        ##  failureThreshold: 1
        resources: {}
      restartPolicy: Always
      serviceAccountName: ""
status: {}
---
apiVersion: v1
kind: Service
metadata:
  name: microbot
  labels:
    app: microbot
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: microbot

The issue is that if I have any delay in the startupProbe or if there is an initial failure, the pod gets into Initialized:true state but had Ready:False and ContainersReady:False. It will not change from this state for 1-1.5 minutes. I haven't found a pattern to the settings.

I left in the comment out settings as well so you can see what I am trying to get to here. What I have is a container starting up that has a service that will take a few seconds to get started. I want to tell the startupProbe to wait a little bit and then check every second to see if we are ready to go. The configuration seems to work, but there is a baked in delay that I can't track down. Even after the startup probe is passing, it does not transition the pod to Ready for more than a minute.

Is there some setting elsewhere in k8s that is delaying the amount of time before a Pod can move into Ready if it isn't Ready initially?

Any ideas are greatly appreciated.

like image 745
Allen Avatar asked Dec 02 '20 17:12

Allen


People also ask

How do you fix readiness probe failure?

Increase the Timeout of the Readiness Probe To increase the Readiness probe timeout, configure the Managed controller item and update the value of "Readiness Timeout". By default it set to 5 (5 seconds). You may increase it to for example 30 (30 seconds).

What happens if readiness probe fails?

After a liveness probe fail, the container should restart and ideally should start serving the traffic again, just like how it would happen for a k8s deployment.

What is difference between readiness and liveness probe?

In short, failing liveness probe will restart the container, while failing readiness probe will stop our application from serving traffic. Readiness probe: Readiness probes are designed to let Kubernetes know when our app is ready to serve traffic.


1 Answers

Actually I made a mistake in comments, you can use initialDelaySeconds in startupProbe, but you should rather use failureThreshold and periodSeconds instead.


As mentioned here

Kubernetes Probes

Kubernetes supports readiness and liveness probes for versions ≤ 1.15. Startup probes were added in 1.16 as an alpha feature and graduated to beta in 1.18 (WARNING: 1.16 deprecated several Kubernetes APIs. Use this migration guide to check for compatibility). All the probe have the following parameters:

  • initialDelaySeconds : number of seconds to wait before initiating liveness or readiness probes
  • periodSeconds: how often to check the probe
  • timeoutSeconds: number of seconds before marking the probe as timing out (failing the health check)
  • successThreshold : minimum number of consecutive successful checks for the probe to pass
  • failureThreshold : number of retries before marking the probe as failed. For liveness probes, this will lead to the pod restarting. For readiness probes, this will mark the pod as unready.

So why should you use failureThreshold and periodSeconds?

consider an application where it occasionally needs to download large amounts of data or do an expensive operation at the start of the process. Since initialDelaySeconds is a static number, we are forced to always take the worst-case scenario (or extend the failureThreshold that may affect long-running behavior) and wait for a long time even when that application does not need to carry out long-running initialization steps. With startup probes, we can instead configure failureThreshold and periodSeconds to model this uncertainty better. For example, setting failureThreshold to 15 and periodSeconds to 5 means the application will get 15 (fifteen) x 5 (five) = 75s to startup before it fails.

Additionally if you need more informations take a look at this article on medium.


Quoted from kubernetes documentation about Protect slow starting containers with startup probes

Sometimes, you have to deal with legacy applications that might require an additional startup time on their first initialization. In such cases, it can be tricky to set up liveness probe parameters without compromising the fast response to deadlocks that motivated such a probe. The trick is to set up a startup probe with the same command, HTTP or TCP check, with a failureThreshold * periodSeconds long enough to cover the worse case startup time.

So, the previous example would become:

ports:
- name: liveness-port
  containerPort: 8080
  hostPort: 8080

livenessProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 1
  periodSeconds: 10

startupProbe:
  httpGet:
    path: /healthz
    port: liveness-port
  failureThreshold: 30
  periodSeconds: 10

Thanks to the startup probe, the application will have a maximum of 5 minutes (30 * 10 = 300s) to finish its startup. Once the startup probe has succeeded once, the liveness probe takes over to provide a fast response to container deadlocks. If the startup probe never succeeds, the container is killed after 300s and subject to the pod's restartPolicy.

like image 74
Jakub Avatar answered Sep 17 '22 00:09

Jakub