Question: What are good strategies for achieving 0 (or as close as possible to 0) downtime when using Django?
Most of the answer I read say "use south" or "use fabric", but those are very vague answer IMHO. I actually use both, and am still wondering how to achieve zero downtime as much as possible.
Some details:
I have a decently sized Django application that I host at EC2. I use South for schema and data migrations as well as fabric with boto for automating repetitive deployment/backup tasks that get triggered through a set of Jenkins (continuous integration server) tasks. The database I use is a standard PostgreSQL 9.0 instance.
I have a...
staging server that gets constantly edited by our team with all the new content and gets loaded with latest and greatest code and a...
live server that keeps changing with user accounts and user data - all recorded in PostgreSQL.
Current deployment strategy:
When deploying new code and content, two EC2 snapshots of both servers (live and staging) are created. The live is switched to an "Updating new content" page...
Downtime begins.
The live-clone server gets migrated to the same schema version as staging server (using south). A dump of only the tables and sequences that I want preserved from live gets created (particularly, the user accounts along with their data). Once this is done, the dump gets uploaded to the staging-clone server. The tables that were preserved from live are truncated and the data gets inserted. As the data in my live server grows, this time obviously keeps increasing.
Once the load is complete the elastic ips of the live server gets changed to the staging-clone (and thus it has been promoted to be the new live). The live instance and the live-clone instance get terminated.
Downtime ends.
Yes this works, but as data grows, my "virtual" zero downtime gets further and further away. Of course, something that has crossed my mind is to somehow leverage replication and to start looking into PostgreSQL replication and "eventually consistent" approaches. I know there is some magic I could do perhaps with load balancers, but the issue of accounts created in the meantime make it tricky.
What would you recommend I look at?
Update:
I have a typical Django single node application. I was hoping for a solution that would go more in depth with django specific issues. For example, the idea of using Django's support for multiple databases with custom routers alongside replication has crossed my mind. There are issues related to that which I hope answer would touch upon.
What might be interested to look at is a technique called Canary Releasing. I saw a great presentation of Jez Humble last year at a software conference in Amsterdam; it was about low risk releases, the slides are here.
The idea is to not switch all systems at once, but to send a small set of users to the new version. Only when all performance metrics of the new systems are like expected, the others are switched over as well. I know that this technique is also used by big sites like facebook.
The live server should not get migrated. That server should be accessible from two staging servers, server0 and server1. Initially, server0 is live, and changes are made to server1. When you want to change software, switch live servers. As to new content, that should not be on the staging server. That should be on the live server. Add a column to your tables with a version number for the content tables, and modify your code base to use the correct version number of content. Develop software to copy old versions to new rows with updated version numbers as needed. Put the current version number in your settings.py on server0 and server1, so you have a central place for software to refer to when selecting data, or create a database access app that can be updated to get correct versions of content. Of course, for template files those can be on each server and will be appropriate.
This approach will eliminate any downtime. You will have to rewrite some of your software, but if you find a common access method, such as a database access method that you can modify, you might find it is not that much work. The up front investment in creating a system that specifically supports instant switching of systems will be much less work in the long term, and will be scalable to any content size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With