Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cloud Run Flask API container running shutit enters a sleep loop

The issue has appeared recently and the previously healthy container now enters a sleep loop when a shutit session is being created. The issue occurs only on Cloud Run and not locally.

Minimum reproducible code:

requirements.txt

Flask==2.0.1
gunicorn==20.1.0
shutit

Dockerfile

FROM python:3.9

# Allow statements and log messages to immediately appear in the Cloud Run logs
ENV PYTHONUNBUFFERED True

COPY requirements.txt ./
RUN pip install -r requirements.txt

# Copy local code to the container image.
ENV APP_HOME /myapp
WORKDIR $APP_HOME
COPY . ./

CMD exec gunicorn \
 --bind :$PORT \
 --worker-class "sync" \
 --workers 1 \
 --threads 1 \
 --timeout 0 \
 main:app

main.py

import os
import shutit
from flask import Flask, request

app = Flask(__name__)

# just to prove api works
@app.route('/ping', methods=['GET'])
def ping():
    os.system('echo pong')
    return 'OK'

# issue replication
@app.route('/healthcheck', methods=['GET'])
def healthcheck():
    os.system("echo 'healthcheck'")
    # hangs inside create_session
    shell = shutit.create_session(echo=True, loglevel='debug')
    # never shell.send reached 
    shell.send('echo Hello World', echo=True)
    # never returned
    return 'OK'

if __name__ == '__main__':
    app.run(host='127.0.0.1', port=8080, debug=True)

cloudbuild.yaml

steps:
  - id: "build_container"
    name: "gcr.io/kaniko-project/executor:latest"
    args:
      - --destination=gcr.io/$PROJECT_ID/borked-service-debug:latest
      - --cache=true
      - --cache-ttl=99h
  - id: "configure infrastructure"
    name: "gcr.io/cloud-builders/gcloud"
    entrypoint: "bash"
    args:
      - "-c"
      - |
        set -euxo pipefail

        REGION="europe-west1"
        CLOUD_RUN_SERVICE="borked-service-debug"

        SA_NAME="$${CLOUD_RUN_SERVICE}@${PROJECT_ID}.iam.gserviceaccount.com"

        gcloud beta run deploy $${CLOUD_RUN_SERVICE} \
          --service-account "$${SA_NAME}" \
          --image gcr.io/${PROJECT_ID}/$${CLOUD_RUN_SERVICE}:latest \
          --allow-unauthenticated \
          --platform managed \
          --concurrency 1 \
          --max-instances 10 \
          --timeout 1000s \
          --cpu 1 \
          --memory=1Gi \
          --region "$${REGION}"

cloud run logs that get looped:

Setting up prompt
In session: host_child, trying to send: export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
================================================================================
Sending>>> export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'<<<, expecting>>>['\r\nORIGIN_ENV:rkkfQQ2y# ']<<<
Sending in pexpect session (68242035994000): export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Expecting: ['\r\nORIGIN_ENV:rkkfQQ2y# ']
export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
root@localhost:/myapp# export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Stopped sleep .05
Stopped sleep 1
pexpect: buffer: b'' before: b'cm9vdEBsb2NhbGhvc3Q6L3B1YnN1YiMgIGV4cx' after: b'DQpPUklHSU5fRU5WOnJra2ZRUTJ5IyA='
Resetting default expect to: ORIGIN_ENV:rkkfQQ2y# 
In session: host_child, trying to send: stty cols 65535
================================================================================
Sending>>> stty cols 65535<<<, expecting>>>ORIGIN_ENV:rkkfQQ2y# <<<
Sending in pexpect session (68242035994000): stty cols 65535
Expecting: ORIGIN_ENV:rkkfQQ2y# 
ORIGIN_ENV:rkkfQQ2y# stty cols 65535
stty cols 65535
Stopped stty cols 65535
Stopped sleep .05
Stopped sleep 1

Workarounds tried:

  • Different regions: a few European(tier 1 and 2), Asia, US.
  • Build with docker instead of kaniko
  • Different CPU and Memory allocated to the container
  • Minimum number of containers 1-5 (to ensure CPU is always allocated to the container)
  • --no-cpu-throttling also made no difference
  • Maximum number of containers 1-30
  • Different GCP project
  • Different Docker base images (3.5-3.9 + various shas ranging from a year ago to recent ones)
like image 953
alanmynah Avatar asked Sep 21 '21 08:09

alanmynah


1 Answers

I have reproduced your issue and we have discussed several possibilities, I think the issue is your Cloud Run not being able to process requests and hence preparing to shut down(sigterm). I am listing some possibilities for you to look at and analyse.

  • A good reason for your Cloud Run service failing to start is that the server process inside the container is configured to listen on the localhost (127.0.0.1) address. This refers to the loopback network interface, which is not accessible from outside the container and therefore Cloud Run health check cannot be performed, causing the service deployment failure. To solve this, configure your application to start the HTTP server to listen on all network interfaces, commonly denoted as 0.0.0.0.

  • While searching for the cloud logs error you are getting, I came across this answer and GitHub link from the shutit library developer which points to a technique for tracking inputs and outputs in complex container builds in shutit sessions. One good finding from the GitHub link, I think you will have to pass the session_type in shutit.create_session(‘bash’) or shutit.create_session(‘docker’) which you are not specifying in the main.py file. That can be the reason your shutit session is failing.

  • Also this issue could be due to some Linux kernel feature used by this shutit library which is not currently supported properly in gVisor . I am not sure how it was executed for you the first time. Most apps will work fine, or at least as well as in regular Docker, but may not provide 100% compatibility.

    Cloud Run applications run on gVisor container sandbox(which supports Linux only currently), which executes Linux kernel system calls made by your application in userspace. gVisor does not implement all system calls (see here). From this Github link, “If your app has such a system call (quite rare), it will not work on Cloud Run. Such an event is logged and you can use strace to determine when the system call was made in your app”

    If you're running your code on Linux, install and enable strace: sudo apt-get install strace Run your application with strace by prefacing your usual invocation with strace -f where -f means to trace all child threads. For example, if you normally invoke your application with ./main, you can run it with strace by invoking /usr/bin/strace -f ./main

    From this documentation, “ if you feel your issue is caused by a limitation in the Container sandbox . In the Cloud Logging section of the GCP Console (not in the "Logs'' tab of the Cloud Run section), you can look for Container Sandbox with a DEBUG severity in the varlog/system logs or use the Log Query:

resource.type="cloud_run_revision"
logName="projects/PROJECT_ID/logs/run.googleapis.com%2Fvarlog%2Fsystem"

For example: Container Sandbox: Unsupported syscall
setsockopt(0x3,0x1,0x6,0xc0000753d0,0x4,0x0)”

By default, container instances have min-instances turned off, with a setting of 0. We can change this default using the Cloud Console, the gcloud command line, or a YAML file, by specifying a minimum number of container instances to be kept warm and ready to serve requests.

You can also have a look at this documentation and GitHub Link which talks about the Cloud Run container runtime behaviour and troubleshooting for reference.

like image 177
Priyashree Bhadra Avatar answered Nov 07 '22 10:11

Priyashree Bhadra