Updating an application with 100% uptime

Question

In a past interview, I was asked how would I write a mission critical windows service which must maintain 100% uptime, be very responsive, and also be updatable. The service was described as a remoting based application which takes in requests, performs calculations and sends a response back.

My solution was to have a very generic service which simply acts as a gateway. This service would never be stopped. It would queue up the requests and forward them on to another service in a separate app domain which would actually handle the request. There would need to be at least two of these handling services so one could be brought down to be updated while the other would responded to incoming requests. The interfaces between the services would include an ability to handshake to see if a service was running. A very small timeout would exist so if a service was completly out it wouldn't hold the request up. I also emphasized the point that this solution could scale out well as you could add more of these services on different boxes.

The interviewer wasn't too crazy about this idea because of issues around latency between communicating across app domains and even across the network. I stated for a mission critical application you should set up a rock solid infrastructure as software alone can't be the answer. He also said they currently have a system in place using relfection. I thought about loading assemblies into an app domain and watching a directory for assembly changes, but this seems way too error prone.

Has anyone build anything with similar requirements? What solutions did you use? What doesn't work? Is reflection a usable option?

Lars Truijens · Accepted Answer

.Net has build in support for updating assemblies while they are in use. It is called Shadow Copy and effectively copies the assemblies to a separate directory before loading them. You still do need to unload the appdomain before you can load the new versions in, but the other appdomains can still use the old versions of the assembly. That way one appdomain can service the requests while the new appdomain loads. This is also how IIS and ASP.Net handle things.

Clayton · Answer

There's no such thing as 100% up time. Even the best systems measure downtime as "5 nines", which means 99.999% up time.

Also, a key point: this measurements applies to unscheduled down time, as in failures. It does not include time when you bring the system down on purpose for scheduled maintenance.

In any case, the goal is to install/update software without incurring downtime, scheduled or otherwise. If dynamic reload isn't supported by the web server natively, your solution seems correct, but I think that is built into a lot of servers these days. That is, you'd just need to drop your new files onto the server and it would automatically see that something had changed and start using them.

However, depending on the nature of the change that might cause problems with session state. That is, existing user sessions could end up with objects stored in session that aren't compatible with your new code. Again, possibly the servers are smart enough to keep cached copies of the original code around until all sessions using the old code have terminated, but maybe you need to handle that yourself. Your "shadow server" approach should handle that nicely.

duffymo · Answer

100% uptime? "Five nines" means 315 seconds of downtime per year. If you could manage that, you'd be doing very well indeed.

Sounds like an impossible interview question. "...maintain 100% up-time, be very responsive, and also be updatable..." - one metric for up-time was given, but none for responsiveness.

Latency IS an issue worth worrying about, but then they said it was a remoting application, so you can't get away from it. I think the interviewer might have been disagreeing for its own sake, maybe to see how you'd handle it.

Kevin Nisbet · Answer

Ok, little background, I work in Wireless Telecom, where our platforms require absolute uptime, and having seen all the different strategies, you absolutly should not be using a software based approach, it adds software complexity, where all you need to do is add some hardware.

Since they asked for a hitless upgrade, that must have a redundant system, and the absolute best way to have a server app be redundant is using a hardware load-balancer. At work we have foundry's and all our new stuff is going on Cisco Ace load-balancers.

So what you need, is two Cisco load balancers, set up HSRP between them for failover between the load-balancers. You can be very aggressive with the failover settings, but in our experience, being too aggressive with these can cause unneeded failovers. Also, make sure to turn off proxy-arp (it'll save you heartache, since cisco has it on by default).

Now, you have a cluster of application servers right? So you have the load-balancers ping, port ping, and monitor application response times. All you need is atleast two servers, but you can add more later (where's the capacity plan?). So now here comes the hitless upgrade, during you're maintenance window, from the load-balancer you can admin down one of you're server's. But the load-balancer can do the really wikid admin downs where any current connections remain until they finish up naturally.

In this state, any requests go to the second server, and you have all the time in the world to do whatever you want to the server you're upgrading. Like really, why write an app that has a fancy app domain reload thing, when you're going to have to reboot the server every 3 months to apply a critical windows patch anyways? Just shell out the cash for the hardware, and have something that will work properly 100% of the time, and can get you in range of those 5x 9's even with the unplanned problems.

Now here's the next step, geographic redundancy. Cisco does have a load-balancing product that can do geographic load-balancing, but I've never seen it. The best geograghic model I've seen is actually based on the requesting application. This isn't a hitless upgrade, but is absolutly reliable. What you do, is in the requesting application configure a primary and failover server IP address. The applation in it's requests see's if the server becomes unavailable, will initiate the same request to the standby, which in this case could be at the same server room, or the backup location. Ideal would be a combination, where the application can target a load-balancer virtual IP at one location, or the backup location, and you can use the load-balancer's to maintain the 100% in each location.

Also, if he's worried about latencies between app domains, or latencies across the network, the guy's on crack, cause using the proper cisco equipment, latencies on a gig link are in the microseconds, and will not be you're weak point.

Good luck.

Updating an application with 100% uptime

Tags:

.net

architecture

Bob

4 Answers

Lars Truijens

Clayton

duffymo

Kevin Nisbet

Recent Activity

Donate For Us

Updating an application with 100% uptime

Tags:

.net

architecture

Bob

4 Answers

Lars Truijens

Clayton

duffymo

Kevin Nisbet

Related questions

Recent Activity

Donate For Us