Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a 'Chaos Monkey' to increase resilience

Apologies for the rather open nature of the question, but I think its a very valuable area of discussion.

Following the recent AWS outage and the huge number of horror stories that followed it, I was really impressed by the Chaos Monkey 'technique' applied by Netflix (one of the few to survive pretty much without a scratch.

For those who don't know the concept, it is essentially a little bot that goes around your infrastructure, causing chaos along the way, as a way of continuously testing resilience.

Besides Jeff Atwood's Chaos Monkey post I've been able to find little on this being employed anywhere else.

Whilst I appreciate that good test-driven development is a solid foundation, I think that this would be a great addition to the arsenal of any company/organisation that wants to stay up.

  • Has anyone else approached this topic before?
  • Are there particular areas other than connectivity and security vulnerability that you would see such a piece of code hitting?
  • Any other thoughts/feelings on this approach?
like image 374
isNaN1247 Avatar asked May 13 '11 20:05

isNaN1247


People also ask

What is Chaos Monkey and why is it important?

Chaos Monkey is a software tool that was developed by Netflix engineers to test the resiliency and recoverability of their Amazon Web Services (AWS). The software simulates failures of instances of services running within Auto Scaling Groups (ASG) by shutting down one or more of the virtual machines.

How does a Chaos Monkey work?

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage.

What insights can you gain from performing chaos testing?

Understanding system operations: A well-engineered chaos test can reveal many valuable insights into how applications respond to emergent situations. Before performing a chaos test, engineers first measure stable conditions, and then formulate a theory about how the system will handle a particular type of stress.


1 Answers

There are several tests you could do to stress your system. I like to use apache bench to load test a page that writes to the database. I test it both for number of hits and concurrent users

500 concurrent users making a total of 5000 requests
$ ab -n 5000 -c 500 url

I know my webserver can stand up to this, but I found a problem with how I was logging information. You could point that a different aspects of your site.

If you use caching you could clear the cache in the middle of the testing to see that everything recovers quickly.

If you can replicate your server in a VM, change amount of RAM, unmount a hard disk, run out of disk space, disconnect network interface, etc.

You could try to brute force a password and make sure your system only allows n login attempts before rate limiting that user.

like image 127
mcotton Avatar answered Oct 14 '22 04:10

mcotton