The issue has appeared recently and the previously healthy container now enters a sleep loop when a shutit session is being created. The issue occurs only on Cloud Run and not locally.
Minimum reproducible code:
requirements.txt
Flask==2.0.1
gunicorn==20.1.0
shutit
Dockerfile
FROM python:3.9
# Allow statements and log messages to immediately appear in the Cloud Run logs
ENV PYTHONUNBUFFERED True
COPY requirements.txt ./
RUN pip install -r requirements.txt
# Copy local code to the container image.
ENV APP_HOME /myapp
WORKDIR $APP_HOME
COPY . ./
CMD exec gunicorn \
--bind :$PORT \
--worker-class "sync" \
--workers 1 \
--threads 1 \
--timeout 0 \
main:app
main.py
import os
import shutit
from flask import Flask, request
app = Flask(__name__)
# just to prove api works
@app.route('/ping', methods=['GET'])
def ping():
os.system('echo pong')
return 'OK'
# issue replication
@app.route('/healthcheck', methods=['GET'])
def healthcheck():
os.system("echo 'healthcheck'")
# hangs inside create_session
shell = shutit.create_session(echo=True, loglevel='debug')
# never shell.send reached
shell.send('echo Hello World', echo=True)
# never returned
return 'OK'
if __name__ == '__main__':
app.run(host='127.0.0.1', port=8080, debug=True)
cloudbuild.yaml
steps:
- id: "build_container"
name: "gcr.io/kaniko-project/executor:latest"
args:
- --destination=gcr.io/$PROJECT_ID/borked-service-debug:latest
- --cache=true
- --cache-ttl=99h
- id: "configure infrastructure"
name: "gcr.io/cloud-builders/gcloud"
entrypoint: "bash"
args:
- "-c"
- |
set -euxo pipefail
REGION="europe-west1"
CLOUD_RUN_SERVICE="borked-service-debug"
SA_NAME="$${CLOUD_RUN_SERVICE}@${PROJECT_ID}.iam.gserviceaccount.com"
gcloud beta run deploy $${CLOUD_RUN_SERVICE} \
--service-account "$${SA_NAME}" \
--image gcr.io/${PROJECT_ID}/$${CLOUD_RUN_SERVICE}:latest \
--allow-unauthenticated \
--platform managed \
--concurrency 1 \
--max-instances 10 \
--timeout 1000s \
--cpu 1 \
--memory=1Gi \
--region "$${REGION}"
cloud run logs that get looped:
Setting up prompt
In session: host_child, trying to send: export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
================================================================================
Sending>>> export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'<<<, expecting>>>['\r\nORIGIN_ENV:rkkfQQ2y# ']<<<
Sending in pexpect session (68242035994000): export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Expecting: ['\r\nORIGIN_ENV:rkkfQQ2y# ']
export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
root@localhost:/myapp# export PS1_ORIGIN_ENV=$PS1 && PS1='OR''IGIN_ENV:rkkfQQ2y# ' && PROMPT_COMMAND='sleep .05||sleep 1'
Stopped sleep .05
Stopped sleep 1
pexpect: buffer: b'' before: b'cm9vdEBsb2NhbGhvc3Q6L3B1YnN1YiMgIGV4cx' after: b'DQpPUklHSU5fRU5WOnJra2ZRUTJ5IyA='
Resetting default expect to: ORIGIN_ENV:rkkfQQ2y#
In session: host_child, trying to send: stty cols 65535
================================================================================
Sending>>> stty cols 65535<<<, expecting>>>ORIGIN_ENV:rkkfQQ2y# <<<
Sending in pexpect session (68242035994000): stty cols 65535
Expecting: ORIGIN_ENV:rkkfQQ2y#
ORIGIN_ENV:rkkfQQ2y# stty cols 65535
stty cols 65535
Stopped stty cols 65535
Stopped sleep .05
Stopped sleep 1
Workarounds tried:
--no-cpu-throttling
also made no differenceI have reproduced your issue and we have discussed several possibilities, I think the issue is your Cloud Run not being able to process requests and hence preparing to shut down(sigterm). I am listing some possibilities for you to look at and analyse.
A good reason for your Cloud Run service failing to start is that the server process inside the container is configured to listen on the localhost (127.0.0.1) address. This refers to the loopback network interface, which is not accessible from outside the container and therefore Cloud Run health check cannot be performed, causing the service deployment failure. To solve this, configure your application to start the HTTP server to listen on all network interfaces, commonly denoted as 0.0.0.0.
While searching for the cloud logs error you are getting, I came
across this answer and GitHub link from the shutit library
developer which points to a technique for tracking inputs and outputs
in complex container builds in shutit sessions. One good finding
from the GitHub link, I think you will have to pass the session_type
in shutit.create_session(‘bash’)
or shutit.create_session(‘docker’)
which you are not specifying in the main.py file. That can be the
reason your shutit session is failing.
Also this issue could be due to some Linux kernel feature used by this shutit library which is not currently supported properly in gVisor . I am not sure how it was executed for you the first time. Most apps will work fine, or at least as well as in regular Docker, but may not provide 100% compatibility.
Cloud Run applications run on gVisor container sandbox(which supports Linux only currently), which executes Linux kernel system calls made by your application in userspace. gVisor does not implement all system calls (see here). From this Github link, “If your app has such a system call (quite rare), it will not work on Cloud Run. Such an event is logged and you can use strace to determine when the system call was made in your app”
If you're running your code on Linux, install and enable strace:
sudo apt-get install strace
Run your application with strace by
prefacing your usual invocation with strace -f where -f means to
trace all child threads. For example, if you normally invoke your
application with ./main
, you can run it with strace by invoking /usr/bin/strace -f ./main
From this documentation, “ if you feel your issue is caused by
a limitation in the Container sandbox . In the Cloud Logging section
of the GCP Console (not in the "Logs'' tab of the Cloud Run section),
you can look for Container Sandbox
with a DEBUG
severity in the
varlog/system
logs or use the Log Query:
resource.type="cloud_run_revision" logName="projects/PROJECT_ID/logs/run.googleapis.com%2Fvarlog%2Fsystem"
For example: Container Sandbox: Unsupported syscall
setsockopt(0x3,0x1,0x6,0xc0000753d0,0x4,0x0)”
By default, container instances have min-instances turned off, with a setting of 0. We can change this default using the Cloud Console, the gcloud command line, or a YAML file, by specifying a minimum number of container instances to be kept warm and ready to serve requests.
You can also have a look at this documentation and GitHub Link which talks about the Cloud Run container runtime behaviour and troubleshooting for reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With