Writing a 'Chaos Monkey' to increase resilience

Tags:

Apologies for the rather open nature of the question, but I think its a very valuable area of discussion.

Following the recent AWS outage and the huge number of horror stories that followed it, I was really impressed by the Chaos Monkey 'technique' applied by Netflix (one of the few to survive pretty much without a scratch.

For those who don't know the concept, it is essentially a little bot that goes around your infrastructure, causing chaos along the way, as a way of continuously testing resilience.

Besides Jeff Atwood's Chaos Monkey post I've been able to find little on this being employed anywhere else.

Whilst I appreciate that good test-driven development is a solid foundation, I think that this would be a great addition to the arsenal of any company/organisation that wants to stay up.

Has anyone else approached this topic before?
Are there particular areas other than connectivity and security vulnerability that you would see such a piece of code hitting?
Any other thoughts/feelings on this approach?

374

asked May 13 '11 20:05

isNaN1247

1 Answers

There are several tests you could do to stress your system. I like to use apache bench to load test a page that writes to the database. I test it both for number of hits and concurrent users

500 concurrent users making a total of 5000 requests
$ ab -n 5000 -c 500 url

I know my webserver can stand up to this, but I found a problem with how I was logging information. You could point that a different aspects of your site.

If you use caching you could clear the cache in the middle of the testing to see that everything recovers quickly.

If you can replicate your server in a VM, change amount of RAM, unmount a hard disk, run out of disk space, disconnect network interface, etc.

You could try to brute force a password and make sure your system only allows n login attempts before rate limiting that user.

127

answered Oct 14 '22 04:10

mcotton

Related questions
                            
                                javax.ws.rs package
                            
                                GZip compression in WCF WebService
                            
                                Rails 3 RESTful web services with json
                            
                                What does RESTful web applications mean? [closed]
                            
                                How can I prevent an out-parameter to end up return parameter in a WCF web service?
                            
                                Is it possible to use persistent connections with System.Net.Http.HttpClient?
                            
                                Impersonation only works when a user is specificed
                            
                                Sending image in base64 to Webservice - 'application/octet-stream' was not the expected type 'text/xml; charset=utf-8'
                            
                                Rest Filter : registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime
                            
                                Jax RS Authorization
                            
                                Top-down Web Service Generation using AXIS1 is taking my complexType apart
                            
                                Returning Large Results Via a Webservice
                            
                                Impersonation and NetworkCredential
                            
                                Making A Webservice Secure
                            
                                How can i have two separate web services with identical name space and local name requests be routed to different end points?
                            
                                Should I Use Core Data or not?
                            
                                .Net Web API - GetCookie() returns empty collection
                            
                                laravel validate Content-Type: application/json request
                            
                                System property "javax.xml.soap.MessageFactory" for two different soap versions
                            
                                Has Chrome a timeout itself for web service calls?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing a 'Chaos Monkey' to increase resilience

Tags:

web-services

testing

infrastructure

isNaN1247

People also ask

1 Answers

mcotton

Recent Activity

Donate For Us