Apologies for the rather open nature of the question, but I think its a very valuable area of discussion.
Following the recent AWS outage and the huge number of horror stories that followed it, I was really impressed by the Chaos Monkey 'technique' applied by Netflix (one of the few to survive pretty much without a scratch.
For those who don't know the concept, it is essentially a little bot that goes around your infrastructure, causing chaos along the way, as a way of continuously testing resilience.
Besides Jeff Atwood's Chaos Monkey post I've been able to find little on this being employed anywhere else.
Whilst I appreciate that good test-driven development is a solid foundation, I think that this would be a great addition to the arsenal of any company/organisation that wants to stay up.
Chaos Monkey is a software tool that was developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS). The software simulates failures of instances of services running within Auto Scaling Groups (ASG) by shutting down one or more of the virtual machines.
Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.
Understanding system operations: A well-engineered chaos test can reveal many valuable insights into how applications respond to emergent situations. Before performing a chaos test, engineers first measure stable conditions, and then formulate a theory about how the system will handle a particular type of stress.
There are several tests you could do to stress your system. I like to use apache bench to load test a page that writes to the database. I test it both for number of hits and concurrent users
500 concurrent users making a total of 5000 requests
$ ab -n 5000 -c 500 url
I know my webserver can stand up to this, but I found a problem with how I was logging information. You could point that a different aspects of your site.
If you use caching you could clear the cache in the middle of the testing to see that everything recovers quickly.
If you can replicate your server in a VM, change amount of RAM, unmount a hard disk, run out of disk space, disconnect network interface, etc.
You could try to brute force a password and make sure your system only allows n login attempts before rate limiting that user.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With