Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kubernetes Job failed with no logs, no termination reason, no events

Tags:

kubernetes

I ran a Job in Kubernetes overnight. When I check it in the morning, it had failed. Normally, I'd check the pod logs or the events to determine why. However, the pod was deleted and there are no events.

kubectl describe job topics-etl --namespace dnc

Here is the describe output:

Name:           topics-etl
Namespace:      dnc
Selector:       controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
Labels:         controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
                job-name=topics-etl
Annotations:    kubectl.kubernetes.io/last-applied-configuration:
                  {"apiVersion":"batch/v1","kind":"Job","metadata":{"annotations":{},"name":"topics-etl","namespace":"dnc"},"spec":{"template":{"spec":{"con...
Parallelism:    1
Completions:    1
Start Time:     Fri, 02 Aug 2019 22:38:56 -0500
Pods Statuses:  0 Running / 0 Succeeded / 1 Failed
Pod Template:
  Labels:  controller-uid=391cb7e5-b5a0-11e9-a905-0697dd320292
           job-name=topics-etl
  Containers:
   docsund-etl:
    Image:      acarl005/docsund-topics-api:0.1.4
    Port:       <none>
    Host Port:  <none>
    Command:
      ./create-topic-data
    Requests:
      cpu:     1
      memory:  1Gi
    Environment:
      AWS_ACCESS_KEY_ID:      <set to the key 'access_key_id' in secret 'aws-secrets'>      Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'secret_access_key' in secret 'aws-secrets'>  Optional: false
      AWS_S3_CSV_PATH:        <set to the key 's3_csv_path' in secret 'aws-secrets'>        Optional: false
    Mounts:
      /app/state from topics-volume (rw)
  Volumes:
   topics-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  topics-volume-claim
    ReadOnly:   false
Events:         <none>

Here is the job config yaml. It has restartPolicy: OnFailure, but it never restarted. I also have no TTL set so pods should never get cleaned up.

apiVersion: batch/v1
kind: Job
metadata:
  name: topics-etl
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
        - name: docsund-etl
          image: acarl005/docsund-topics-api:0.1.6
          command: ["./create-topic-data"]
          env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-secrets
                  key: access_key_id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-secrets
                  key: secret_access_key
            - name: AWS_S3_CSV_PATH
              valueFrom:
                secretKeyRef:
                  name: aws-secrets
                  key: s3_csv_path
          resources:
            requests:
              cpu: 1
              memory: 1Gi
          volumeMounts:
            - name: topics-volume
              mountPath: /app/state
      volumes:
        - name: topics-volume
          persistentVolumeClaim:
            claimName: topics-volume-claim

How can I debug this?

like image 909
Andy Carlson Avatar asked Aug 03 '19 17:08

Andy Carlson


1 Answers

The TTL would clean up the Job itself and all it's children objects. ttlSecondsAfterFinished is unset so the Job hasn't been cleaned up.

From the job docco

Note: If your job has restartPolicy = "OnFailure", keep in mind that your container running the Job will be terminated once the job backoff limit has been reached. This can make debugging the Job’s executable more difficult. We suggest setting restartPolicy = "Never" when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.

The Job spec you posted doesn't have a backoffLimit so it should try to run the underlying task 6 times.

If the container process exits with a non zero status then it will fail, so can be entirely silent in the logs.

The spec doesn't specify an activeDeadlineSeconds seconds defined so I'm not sure what type of timeout you end up with. I assume this would be a hard failure in the container then so a timeout doesn't come in to play.

like image 181
Matt Avatar answered Nov 04 '22 06:11

Matt