I'm building a server-less web-tracking system which serves its tracking pixel using AWS API Gateway, which calls a Lambda function whenever a tracking request arrives to write the tracking event into a Kinesis stream.
The Lambda function itself does not do anything fancy. It just a takes the incoming event (its own argument) and writes it to the stream. Essentially, it's just:
import boto3
kinesis_client = boto3.client("kinesis")
kinesis_stream = "my_stream_name"
def return_tracking_pixel(event, context):
...
new_record = ...(event)
kinesis_client.put_record(
StreamName=kinesis_stream,
Data=new_record,
PartitionKey=...
)
return ...
Sometimes I experience a weird spike in the Lambda execution duration that causes some of my Lambda function invocations to time-out and the tracking requests to be lost.
This is the graph of 1-minute invocation counts of the Lambda function in the in affected time period:
Between 20:50 and 23:10 I suddenly see many invocation errors (1-minute error counts):
which are obviously caused by the Lambda execution time-out (maximum duration in 1-minute intervals):
There is nothing weird going on neither with my Kinesis stream (data-in, number of put records, put_record success count etc., all looks normal), nor with my API GW (number of invocations corresponds to number of API GW calls, well within the limits of the API GW).
What could be causing the sudden (and seemingly randomly occurring) spike in the Lambda function execution duration?
EDIT: neither the lambda functions are being throttled, which was my first idea.
There are three reasons why retry and timeout issues occur when invoking a Lambda function with an AWS SDK: A remote API is unreachable or takes too long to respond to an API call. The API call doesn't get a response within the socket timeout.
Finding the root cause of the timeout. There are many reasons why a function might time out, but the most likely is that it was waiting on an IO operation to complete. Maybe it was waiting on another service (such as DynamoDB or Stripe) to respond.
why does it occur? Throttling occurs when your concurrent execution count exceeds your concurrency limit. Now, just as a reminder, if this wasn't clear, Lambda can handle multiple instance invocations at the same time and the sum of all of those invocations amounts to your concurrency execution count.
Just to add my 2 cents, because there's not much investigative work without extra logging or some X-Ray analysis.
AWS Lambda sometimes will force recycle containers which will feel like cold starts even though your function is being reasonably exercised and warmed up. This might bring all cold start related issues, like extra delays for ENIs if your Lambda has an attached VPC and so on... but even for a simple function like yours, 1 second timeout is sometimes too optimistic for a cold start.
I don't know of any documentation on those forced recycles, other than some people having evidence for it.
"We see a forced recycle about 7 times a day." source
"It also appears that even once warmed, high concurrency functions get recycled much faster than those with just a few in memory." source
I wonder how you could confirm this is the case. Perhaps you could check those errors appearing in Cloud Watch log streams to be from containers that never appeared before.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With