I have a site running on amazon elastic beanstalk with the following traffic pattern:
Amazon web services claim to be able to rapidly scale to challenges like this but the "Greater than x for more than 1 minute" setup of cloudwatch doesn't appear to be fast enough for this traffic pattern?
Usually within seconds all the ec2 instances crash, killing all cloudwatch metrics and the whole site is down for 4/6 minutes. So far I've yet to find a configuration that works for this senario.
Here is the graph of a smaller event that also killed the site:
Are these links posted predictably? If so, you can use Scaling by Schedule or as alternative you might change DESIRED-CAPACITY value of Auto Scaling Group or even trigger as-execute-policy
to scale out straight before your link is posted.
Do you know you can have multiple scaling policies in one group? So you might have special Auto Scaling policy for your case, something like SCALE_OUT_HIGH
which adds say 10 more instances at once. Take a look at as-put-scaling-policy
command.
Also, you need to check your code and find bottle necks.
What HTTPD do you use? Consider of switching to Nginx as it's much more faster and less resource consuming software than Apache. Try to use Memcache... NoSQL like Redis for hight read and writes is fine option as well.
The suggestion from AWS was as follows:
We are always working to make our systems more responsive, but it is challenging to provision virtual servers automatically with a response time of a few seconds as your use case appears to require. Perhaps there is a workaround that responds more quickly or that is more resilient when requests begin to increase.
Have you observed whether the site performs better if you use a larger instance type or a larger number of instances in the steady state? That may be one method to be resilient to rapid increases in inbound requests. Although I recognize it may not be the most cost-effective, you may find this to be a quick fix.
Another approach may be to adjust your alarm to use a threshold or a metric that would reflect (or predict) your demand increase sooner. For example, you might see better performance if you set your alarm to add instances after you exceed 75 or 100 users. You may already be doing this. Aside from that, your use case may have another indicator that predicts a demand increase, for example a posting on your Facebook page may precede a significant request increase by several seconds or even a minute. Using CloudWatch custom metrics to monitor that value and then setting an alarm to Auto Scale on it may also be a potential solution.
So I think the best answer is to run more instances at lower traffic and use custom metrics to predict traffic from an external source. I am going to try, for example, monitoring Facebook and Twitter for posts with links to the site and scaling up straight away.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With