What are common practices for deployment of large scale systems?

Question

Given a large scale software project with several components written in different languages, configuration files, configuration scripts, environment settings and database migration scripts - what are the common practices for deployment to production?

What are the difficulties to consider? Can the process be simplified with tools like Ant or Maven? How can rollback and database management be handled? Is it advisable to use version control for the production environment?

Alex Martelli · Accepted Answer

As I see it, you're mostly asking about best practices and tools for release engineering AKA releng -- it's important to know the "term of art" for a subject, because it makes it much easier to search for more information.

A configuration management system (CMS -- aka revision control system or version control system) is indispensable for today's software development; if you use one or more IDEs, it's also nice to have good integration between them and the CMS, though that's more of an issue for purposes of development than for purposes of deployment / releng.

From a releng viewpoint, the key thing about a CMS is that it must have good support for "branching" (under whatever name), because releases must be made from a "release branch" where all the code under development and all of its dependencies (code and data) are in a stable "snapshot" from which the exact, identical configuration can be reproduced at will.

The need for good branching support may be more obvious if you have to maintain multiple branches (customized for different uses, platforms, whatever), but even if your releases are always, strictly in a single linear sequence, releng best practices still dictate making a release branch. "Good branching support" includes ease of merging (and "conflict resolution" when different changes are made to a file), "cherry-picking" (taking one patch or changeset from one branch, or the head/trunk, and applying it to another branch), and the like.

In practice, you start off the release process by making a release branch; then, you run exhaustive testing on that branch (typically MUCH more than what you run everyday in your continuous build -- including extensive regression testing, integration testing, load testing, performance verification, etc, and possibly even more costly quality assurance processes, depending). If and when the exhaustive testing and QA reveal defects in the release-candidate (including regressions, performance degradation, etc), they must be fixed; in a large team, development on head/trunk may be continuing while the QA is being done, whence the need for ease of cherry-picking / merging / etc (whether your practice is to perform the fix on head or on the release branch, it still needs to be merged to the other side;-).

Last but not least, you're NOT getting full releng value from your CMS unless you're somehow tracking with it "everything" that your releases depend on -- simplest would be to have copies or hard links to all the binaries for the tools you need to build your release, etc, but that may often be impractical; so at the very least track the exact release, version, bugfix &c numbers of those tools that are used (operating system, compilers, system libraries, tools that preprocess image, sound or video files into final form, etc, etc). The key is being able, at need, to exactly reproduce the environment required to rebuild the exact version that's proposed for release (otherwise you'll go mad tracking down subtle bugs that may depend on third party tools' changes as their versions change;-).

After a CMS, the second most important tool for releng is a good issue tracking system -- ideally one that's well integrated with the CMS. That's also important for the development process (and other aspects of product management), but in terms of the release process the importance of the issue tracker is the ability to easily document exactly what bugs have been fixed, what features have been added, removed, or changed, and what modifications in performance (or other user-observable characteristics) are expected in the new forthcoming release. For the purpose, a key "best practice" in development is that every changeset that gets committed to the CMS must be connected to one (or more) issue in the issue tracking system: after all, there's gotta be some purpose for that change (fix a bug, change a feature, optimize something, or some internal refactor that's supposed to be invisible to the software's user); similarly, every tracked issue that's marked as "closed" must be connected to one (or more) changesets (unless the closing is of the "won't fix / working as intended" kind; issues related to bugs &c in third-party components, which have been fixed by the third-party supplier, are easy to treat similarly if you do manage to keep track of all third-party components in the CMS too, see above; if you don't, at least there should be text files under CMS documenting third-party components and their evolution, again see above, and they need to be changed when some tracked issue on a 3rd party component gets closed).

Automating the various releng processes (including building, automated testing, and deployment tasks) is the third top priority -- automated processes are much more productive and repeatable than asking some poor individual to manually go through a list of steps (for sufficiently complex tasks, of course, the workflow of the automation may need to "get a human being in the loop"). As you surmise, tools such as Ant (and SCons, etc, etc) can help here, but inevitably (unless you're lucky enough to get away with very simple and straightforward processes) you'll find yourself enriching them with ad-hoc scripts &c (some powerful and flexible scripting language such as perl, python, ruby, &c, will help). A "workflow engine" can also be precious when your release workflow is sufficiently complex (e.g. involving specific human beings or groups thereof "signing off" on QA compliance, legal compliance, UI guidelines compliance, and so forth).

Some other specific issues you're asking about vary enormously depending on specifics of your environment. If you can afford programmed downtime, your life is relatively easy, even with a large database in play, as you can operate sequentially and deterministically: you shut the existing system down gracefully, ensure the current database is saved and backed up (easing rollback, in the hopefully VERY rare case it's needed), run the one-off scripts for schema migration or other "irreversible" environment changes, fire the system back up again in a mode that's still unaccessible to general users, run another extensive suite of automated tests -- and finally if everything's gone smoothly (including the saving and backup of the DB in its new state, if relevant) the system's opened up to general use again.

If you need to update a "live" system, without downtime, this may range anywhere from a slight inconvenience to a systematic nightmare. In the best case, transactions are reasonably short, and synchronization between the state set by transactions may be delayed a bit without damage... and you have a reasonable abundance of resources (CPUs, storage, &c). In this case, you run two systems in parallel -- the old one and the new one -- and just make sure all new transactions target the new system, while letting old ones complete on the old system. A separate task periodically syncs "new data in the old system" to the new system, as transactions on the old system terminate. Eventually you can determine that no transactions are running on the old system and all changes that happened there are synced up to the new one -- and at that time you can finally shut down the old system. (You need to be prepared to "reverse sync" too of course, in case a rollback of the change is needed).

This is the "simple, sweet" extreme for live system updating; at the other extreme, you can find yourself in such an overconstrained situation that you can prove the task is impossible (you just cannot logically meet all the stated requirements with the given resources). Long sessions opened on the old system which just cannot be terminated -- scarce resources that make it impossible to run two systems in parallel -- core requirements for hard-real-time sync of every transaction -- etc, etc, can all make your life miserable (and as I noticed, at the extreme, can make the stated task absolutely impossible). The two best things you can do about this: (1) ensure you have abundant resources (this will also save your skin when some server unexpected goes belly-up... you'll have another one to fire up to meet the emergency!-); (2) consider this predicament from the start, when initially defining the architecture of the overall system (e.g.: prefer short-lived transactions to long-lived sessions that just can't be "snapshot, closed down, and restarted seamlessly from the snapshot", is one good arcitectural pointer;-).

Noon Silk · Answer

disclaimer: where I work we use something I wrote that I'm going to mention

I'll tell you how I do it.

For configuration settings, and general deployment of code and content, I use a combination of NAnt, a CI server, and dashy (an automatic deployment tool). You can replace dashy with any other 'thing', that automates the uploads to your server (perhaps capistrano).

For DB scripts, we use RedGate SQL Compare to get the scripts, and then for most changes, I actually make them manually, where appropriate. This is because some of the changes are slightly complex, and I feel more comfortable doing it by hand. You can actually use this tool to do it for you, (or at least generate the scripts).

Depending on your language, there are some tools that can script the DB updates for you (I think someone on this forum wrote one; hopefully he'll reply), but I have no experience with those. It's something I'd like to add though.

Difficulties

I forgot to answer one of your questions.

The biggest problem in updating any significantly complicated/distributed site is DB synchronisation. You need to consider if you will have any downtime, and if you do, what will happen to the DBs. Will you shutdown everything, so that no transactions can be processed? Or will you shift everything to one server, update DB B, and then sync DB A & B, then update DB B? Or something else?

Whatever you choose, you need to choose it, and either say "Okay for each update, there will be X downtime", or whatever. Just have it documented.

The worst thing you can do is have someones transaction fail, mid processing, because you were updating that server; or to somehow leave only part of your system operational.

What are common practices for deployment of large scale systems?

Tags:

deployment

release-management

parkr

2 Answers

Alex Martelli

Noon Silk

Recent Activity

Donate For Us

What are common practices for deployment of large scale systems?

Tags:

deployment

release-management

parkr

2 Answers

Alex Martelli

Noon Silk

Related questions

Recent Activity

Donate For Us