Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing log lines when writing to cloudwatch from ECS Docker containers

(Docker container on AWS-ECS exits before all the logs are printed to CloudWatch Logs) Why are some streams of a CloudWatch Logs Group incomplete (i.e., the Fargate Docker Container exits successfully but the logs stop being updated abruptly)? Seeing this intermittently, in almost all log groups, however, not on every log stream/task run. I'm running on version 1.3.0


Description:
A Dockerfile runs node.js or Python scripts using the CMD command.

These are not servers/long-running processes, and my use case requires the containers to exit when the task completes.

Sample Dockerfile:

FROM node:6
WORKDIR /path/to/app/
COPY package*.json ./
RUN npm install
COPY . .
CMD [ "node", "run-this-script.js" ]


All the logs are printed correctly to my terminal's stdout/stderr when this command is run on the terminal locally with docker run.
To run these as ECS Tasks on Fargate, the log driver for is set as awslogs from a CloudFormation Template.

...
LogConfiguration:
   LogDriver: 'awslogs'
     Options:
        awslogs-group: !Sub '/ecs/ecs-tasks-${TaskName}'
        awslogs-region: !Ref AWS::Region
        awslogs-stream-prefix: ecs
...

Seeing that sometimes the cloduwatch logs output is incomplete, I have run tests and checked every limit from CW Logs Limits and am certain the problem is not there.
I initially thought this is an issue with node js exiting asynchronously before console.log() is flushed, or that the process is exiting too soon, but the same problem occurs when i use a different language as well - which makes me believe this is not an issue with the code, but rather with cloudwatch specifically.
Inducing delays in the code by adding a sleep timer has not worked for me.

It's possible that since the docker container exits immediately after the task is completed, the logs don't get enough time to be written over to CWLogs, but there must be a way to ensure that this doesn't happen?

sample logs: incomplete stream:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

completed log stream:

{ "message": "configs to run", "data": {"dailyConfigs":"filename.json"]}}
running for filename

stdout: entered query_script
... <more log lines>
stderr:
real 0m23.394s
user 0m0.008s
sys 0m0.004s
(node:1) DeprecationWarning: PG.end is deprecated - please see the upgrade guide at https://node-postgres.com/guides/upgrading
like image 516
tanvi Avatar asked Feb 12 '19 18:02

tanvi


2 Answers

UPDATE: This now appears to be fixed, so there is no need to implement the workaround described below


I've seen the same behaviour when using ECS Fargate containers to run Python scripts - and had the same resulting frustration!

I think it's due to CloudWatch Logs Agent publishing log events in batches:

How are log events batched?

A batch becomes full and is published when any of the following conditions are met:

  1. The buffer_duration amount of time has passed since the first log event was added.

  2. Less than batch_size of log events have been accumulated but adding the new log event exceeds the batch_size.

  3. The number of log events has reached batch_count.

  4. Log events from the batch don't span more than 24 hours, but adding the new log event exceeds the 24 hours constraint.

(Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html)

So a possible explanation is that log events are buffered by the agent but not yet published when the ECS task is stopped. (And if so, that seems like an ECS issue - any AWS ECS engineers willing to give their perspective on this...?)

There doesn't seem to be a direct way to ensure the logs are published, but it does suggest one could wait at least buffer_duration seconds (by default, 5 seconds), and any prior logs should be published.

With a bit of testing that I'll describe below, here's a workaround I landed on. A shell script run_then_wait.sh wraps the command to trigger the Python script, to add a sleep after the script completes.

Dockerfile

FROM python:3.7-alpine
ADD run_then_wait.sh .
ADD main.py .

# The original command
# ENTRYPOINT ["python", "main.py"]

# To run the original command and then wait
ENTRYPOINT ["sh", "run_then_wait.sh", "python", "main.py"]

run_then_wait.sh

#!/bin/sh
set -e

# Wait 10 seconds on exit: twice the `buffer_duration` default of 5 seconds
trap 'echo "Waiting for logs to flush to CloudWatch Logs..."; sleep 10' EXIT

# Run the given command
"$@"

main.py

import logging
import time

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    # After testing some random values, had most luck to induce the
    # issue by sleeping 9 seconds here; would occur ~30% of the time
    time.sleep(9)
    logger.info("Hello world")

Hopefully the approach can be adapted to your situation. You could also implement the sleep inside your script, but it can be trickier to ensure it happens regardless of how it terminates.

It's hard to prove that the proposed explanation is accurate, so I used the above code to test whether the workaround was effective. The test was the original command vs. with run_then_wait.sh, 30 runs each. The results were that the issue was observed 30% of the time, vs 0% of the time, respectively. Hope this is similarly effective for you!

like image 128
asavoy Avatar answered Sep 20 '22 13:09

asavoy


Just contacted AWS support about this issue and here is their response:

...

Based on that case, I can see that this occurs for containers in a Fargate Task that exit quickly after outputting to stdout/stderr. It seems to be related to how the awslogs driver works, and how Docker in Fargate communicates to the CW endpoint.

Looking at our internal tickets for the same, I can see that our service team are still working to get a permanent resolution for this reported bug. Unfortunately, there is no ETA shared for when the fix will be deployed. However, I've taken this opportunity to add this case to the internal ticket to inform the team of the similar and try to expedite the process

In the meantime, this can be avoided by extending the lifetime of the exiting container by adding a delay (~>10 seconds) between the logging output of the application and the exit of the process (exit of the container).

...

Update: Contacted AWS around August 1st, 2019, they say this issue has been fixed.

like image 24
Zhenya Avatar answered Sep 18 '22 13:09

Zhenya