Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

App Engine version served by "default" appears to be inconsistent and thrash for a period after changing the default version

Our application serves an endpoint which simply reports os.environ['CURRENT_VERSION_ID']. We use this for a type of monitoring which tracks which version is currently set as the "default version".

Starting on the afternoon of March 5th, we noticed odd behaviour when making requests to this endpoint. Shortly after we change the default version (via "appcfg.py set_default_version"), repeated requests to this endpoint would flip flop between the previous default and the new default. This persists for a period of about 10 minutes, after which point all subsequent requests will always report the new, correct default version. So it appears as if during this 10 minute window, requests to our normal, default URL, will inconsistently report either the old version or the new one.

This appears to be a change in behaviour. The previous change in default version for our application happened on March 1st, and every other version change prior to that date did not exhibit this flip-flopping behaviour.

(Question stolen from my teammate's bug report)

like image 458
kamens Avatar asked Mar 14 '13 18:03

kamens


1 Answers

First a bit of background:

  • App Engine runs your application in distributed infrastructure: the more traffic your app receives, the more instances (appservers) that will be running your code at any given time
  • For scalability/simplicity and many other reasons, App Engine does not implement client <-> appserver stickiness; as a result any request to the default app version may be handled by any appserver

After changing the default version of your application, either by changing what version is marked as the default via the admin console, or by deploying the same major version as is currently the default, information about this change is propagated through the App Engine infrastructure. As appservers become aware of the new version, they begin loading the new version of your application code. Once a given appserver is ready it will begin serving the new version of your code.

There is some period of time during which some appservers will be serving the previous default version while others are already serving the new default version. It is therefore expected that any app with a non-trivial amount of traffic will see the behavior you described.

We're always working on ways to reduce the amount of time these version changes take, but our foremost concern is to ensure that the transition happens smoothly. If the application has a large number of instances serving the previous version, App Engine needs to ensure that there is always sufficient capacity (combing old and new appservers) to serve all current traffic. The previous and new versions of the app may need a different number of appservers (due to performance differences between versions), which is another reason why the transition cannot safely be executed 'instantly'.

If you'd like more control over the process, you can use App Engine's Traffic Splitting feature. In a step wise fashion you can increase the percentage of user traffic you'd like to direct at the new version. App Engine will then provide version stickiness based on either client IP address or a cookie (for web apps). You can also use Traffic Splitting to 'canary' a new version of the application on some percentage (say 1%) of clients.

like image 184
Fred Sauer Avatar answered Nov 15 '22 10:11

Fred Sauer