Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating custom resources causes them to be deleted?

When using CloudFormation templates, I find the "Custom Resource" feature, with its Lambda backing function implementation, very useful to handle all kinds of tasks that CloudFormation does not provide good support for.

Usually, I use custom resources to setup things during stack creation (such as looking up AMI names) or clean up things during deletion (such as removing objects from S3 or Route53 that would block deletion) - and this works great.

But when I try to actually use a "custom resource" to manage an actual custom resource, that has to be created during stack creation, deleted during stack deletion, and - this is where the problem lies - sometimes updated with new values during a stack update, the CloudFormation integration behaves unexpectedly and causes the custom resource to fail.

The problem seems to be that during a stack update where one of the custom resource properties has changed, during the stack's UPDATE_IN_PROGRESS stage, CloudFormation sends an update event to the backing Lambda function, with all values set correctly and a copy of the old values sent as well. But after the update completes, CloudFormation starts the UPDATE_COMPLETE_CLEANUP_IN_PROGRESS stage and sends the backing Lambda function a delete event (RequestType set to Delete).

When that happens, the backing lambda function assumes the stack is being deleted and removes the custom resource. The result is that after an update the custom resource is gone.

I've looked at the request data in the logs, and the "cleanup delete" looks identical to a real "delete" event:

Cleanup Delete:

{
RequestType: 'Delete',
ServiceToken: 'arn:aws:lambda:us-east-2:1234567890:function:stackname-resname-J0LWT56QSPIA',
ResponseURL: 'https://cloudformation-custom-resource-response-useast2.s3.us-east-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-2%3A1234567890%3Astack/stackname/3cc80cf0-5415-11e8-b6dc-503f3157b0d1%7Cresnmae%7C15521ba8-1a3c-4594-9ea9-18513efb6e8d?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20180511T140259Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7199&X-Amz-Credential=AKISOMEAWSKEYID%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=3abc68e1f8df46a711a2f6084debaf2a16bd0acf7f58837b9d02c805975df91b',
StackId: 'arn:aws:cloudformation:us-east-2:1234567890:stack/stackname/3cc80cf0-5415-11e8-b6dc-503f3157b0d1',
RequestId: '15521ba8-1a3c-4594-9ea9-18513efb6e8d',
LogicalResourceId: 'resname',
PhysicalResourceId: '2018/05/11/[$LATEST]28bad2681fb84c0bbf80990e1decbd97',
ResourceType: 'Custom::Resource',
ResourceProperties: {
    ServiceToken: 'arn:aws:lambda:us-east-2:1234567890:function:stackname-resname-J0LWT56QSPIA',
    VpcId: 'vpc-35512e5d',
    SomeValue: '4'
} 
}

Real Delete:

{
RequestType: 'Delete',
ServiceToken: 'arn:aws:lambda:us-east-2:1234567890:function:stackname-resname-J0LWT56QSPIA',
ResponseURL: 'https://cloudformation-custom-resource-response-useast2.s3.us-east-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-2%3A1234567890%3Astack/stackname/3cc80cf0-5415-11e8-b6dc-503f3157b0d1%7Cresname%7C6166ff92-009d-47ac-ac2f-c5be2c1a7ab2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20180524T154453Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKISOMEAWSKEYID%2F20180524%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Signature=29ca1d0dbdbe9246f7f82c1782726653b2aac8cd997714479ab5a080bab03cac',
StackId: 'arn:aws:cloudformation:us-east-2:123456780:stack/stackname/3cc80cf0-5415-11e8-b6dc-503f3157b0d1',
RequestId: '6166ff92-009d-47ac-ac2f-c5be2c1a7ab2',
LogicalResourceId: 'resname',
PhysicalResourceId: '2018/05/11/[$LATEST]c9494122976b4ef3a4102628fafbd1ec',
ResourceType: 'Custom::Resource',
ResourceProperties: {
    ServiceToken: 'arn:aws:lambda:us-east-2:1234567890:function:stackname-resname-J0LWT56QSPIA',
    VpcId: 'vpc-35512e5d',
    SomeValue: '0'
}
}

The only interesting request field that I can see is the physical resource ID is different, but I don't know what to correlate that to, to detect if it is the real delete or not.

like image 591
Guss Avatar asked May 30 '18 08:05

Guss


2 Answers

The problem seems to be the sample implementation of the sendResponse() function that is used to send the custom resource completion event back to CloudFormation. This method is responsible for setting the custom resource's physical resource ID. As far as I understand, this value represents the globally unique identifier of the "external resource" that is managed by the Lambda function backing the CloudFormation custom resource.

As can be seen in the CloudFormation's "Lambda-backed Custom Resource" sample code, as well as in the cfn-response NPM module's send() and the CloudFormation's built-in cfn-response module, this method has a default behavior for calculating the physical resource ID, if not provided as a 5th parameter, and it uses the CloudWatch Logs' log stream that is handling logging for the request being processed:

var responseBody = JSON.stringify({
    ...
    PhysicalResourceId: context.logStreamName,
    ...
})

Because CloudFormation (or the AWS Lambda runtime?) occasionally changes the log stream to a new one, the physical resource ID generated by sendResponse() is changing unexpectedly from time to time, and confuses CloudFormation.

As I understand it, CloudFormation managed entities sometimes need to be replaced during an update (a good example is RDS::DBInstance that needs replacing for almost any change). CloudFormation policy is that if a resource needs replacing, the new resource is created during the "update stage" and the old resource is deleted during the "cleanup stage".

So using the default sendResponse() physical resource ID calculation, the process looks like this:

  1. A stack is created.
  2. A new log stream is created to handle the custom resource logging.
  3. The backing Lambda function is called to create the resource and the default behavior set its resource ID to be the log stream ID.
  4. Some time passes
  5. The stack gets updated with new parameters for the custom resource.
  6. A new log stream is created to handle the custom resource logging, with a new ID.
  7. The backing Lambda function is called to update the resource and the default behavior set a new resource ID to the new log stream ID.
  8. CloudFormation understands that a new resource was created to replace the old resource and according to the policy it should delete the old resource during the "cleanup stage".
  9. CloudFormation reaches the "cleanup stage" and sends a delete request with the old physical resource ID.

The solution, at least in my case where I never "replace the external resource" is to fabricate a unique identifier for the managed resource, provide it as the 5th parameter to the send response routine, and then stick to it - keep sending the same physical resource ID received in the update request, in the update response. CloudFormation will then never send a delete request during the "cleanup stage".

My implemenation (in JavaScript) looks something like this:

    var resID = event.PhysicalResourceId || uuid();
    ...
    sendResponse(event, context, status, resData, resID);

Another alternative - which would probably only make sense if you actually need to replace the external resource and want to adhere to the CloudFormation model of removing the old resource during cleanup - is to use the actual external resource ID as the physical resource ID, and when receiving a delete request - to use the provided physical resource ID to delete the old external resource. That is what CloudFormation designers probably had in mind in the first place, but their default sample implementation causes a lot of confusion - probably because the sample implementation doesn't manage a real resource and has no update functionality. There is also zero documentation in CloudFormation to explain the design and reasoning.

like image 192
Guss Avatar answered Nov 03 '22 06:11

Guss


It’s important to understand the custom resource life cycle, to prevent your data from being deleted.

A very interesting and important thing to know is that CloudFormation compares the physical resource id you returned by your Lambda function to the one you returned previously. If the IDs are different, CloudFormation assumes the resource has been replaced with a new resource. Then something interesting happens.

When the resource update logic completes successfully, a Delete request is sent with the old physical resource id. If the stack update fails and a rollback occurs, the new physical resource id is sent in the Delete event.

You can read more here about custom resource life cycle and other best practices

like image 24
Bar Schwartz Avatar answered Nov 03 '22 07:11

Bar Schwartz