Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Lambda execution duration randomly spikes and causes time-outs

I'm building a server-less web-tracking system which serves its tracking pixel using AWS API Gateway, which calls a Lambda function whenever a tracking request arrives to write the tracking event into a Kinesis stream.

The Lambda function itself does not do anything fancy. It just a takes the incoming event (its own argument) and writes it to the stream. Essentially, it's just:

import boto3
kinesis_client = boto3.client("kinesis")

kinesis_stream = "my_stream_name"

def return_tracking_pixel(event, context):
    ...
    new_record = ...(event)
    kinesis_client.put_record(
        StreamName=kinesis_stream,
        Data=new_record,
        PartitionKey=...
    )
    return ...

Sometimes I experience a weird spike in the Lambda execution duration that causes some of my Lambda function invocations to time-out and the tracking requests to be lost.

This is the graph of 1-minute invocation counts of the Lambda function in the in affected time period:

enter image description here

Between 20:50 and 23:10 I suddenly see many invocation errors (1-minute error counts):

enter image description here

which are obviously caused by the Lambda execution time-out (maximum duration in 1-minute intervals):

enter image description here

There is nothing weird going on neither with my Kinesis stream (data-in, number of put records, put_record success count etc., all looks normal), nor with my API GW (number of invocations corresponds to number of API GW calls, well within the limits of the API GW).

What could be causing the sudden (and seemingly randomly occurring) spike in the Lambda function execution duration?

EDIT: neither the lambda functions are being throttled, which was my first idea.

like image 773
grepe Avatar asked Jan 18 '17 10:01

grepe


People also ask

Why does my Lambda keep timing out?

There are three reasons why retry and timeout issues occur when invoking a Lambda function with an AWS SDK: A remote API is unreachable or takes too long to respond to an API call. The API call doesn't get a response within the socket timeout.

What is the most likely issue with the Lambda function timeout?

Finding the root cause of the timeout. There are many reasons why a function might time out, but the most likely is that it was waiting on an IO operation to complete. Maybe it was waiting on another service (such as DynamoDB or Stripe) to respond.

Why is my Lambda being throttled?

why does it occur? Throttling occurs when your concurrent execution count exceeds your concurrency limit. Now, just as a reminder, if this wasn't clear, Lambda can handle multiple instance invocations at the same time and the sum of all of those invocations amounts to your concurrency execution count.


1 Answers

Just to add my 2 cents, because there's not much investigative work without extra logging or some X-Ray analysis.

AWS Lambda sometimes will force recycle containers which will feel like cold starts even though your function is being reasonably exercised and warmed up. This might bring all cold start related issues, like extra delays for ENIs if your Lambda has an attached VPC and so on... but even for a simple function like yours, 1 second timeout is sometimes too optimistic for a cold start.

I don't know of any documentation on those forced recycles, other than some people having evidence for it.

"We see a forced recycle about 7 times a day." source

"It also appears that even once warmed, high concurrency functions get recycled much faster than those with just a few in memory." source

I wonder how you could confirm this is the case. Perhaps you could check those errors appearing in Cloud Watch log streams to be from containers that never appeared before.

like image 101
villasv Avatar answered Oct 25 '22 22:10

villasv