Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web application monitoring best practices [closed]

People also ask

How do you monitor the performance of a web application?

The best way to do application availability monitoring is with a simple HTTP ping monitor that runs every minute. For example, we use this at Stackify to monitor our various web applications and marketing websites. We can monitor the response time and ensure that they are responding with an HTTP status code of 200.


Nagios is good, it's good to maybe have system testing (Selenium) running regularily.

Edit: Hyperic and Groundwork also look interesting.

There is probably a test suite system that can keep pressure testing everything as well for you. I can't remember the name off the top of my head, maybe someone can mention one below.

Other things I like to do:

The best motto for infrastructure is always fix, detect, repair. Get it up, get to the root of it, and cure/prevent it if you can.

Since a system exists at many levels, we should test at many levels:

Edit: Have all errors or warnings posted directly to your case manager via email. That way you can track occurrences in one place.

1) Connection : monitor your internet connectivity from the server and from the outside. Log this somewhere

2) Server : monitor all the processes that you need to to ensure they are running and not pinning the server. Use a HP Server or something equivalent with hardware failure notification that it can do from a bios level. Notify and log if they are.

3) Software : Identify the key software that always needs to be running. Set the performance levels if any and then monitor them. Nagios should be able to help with this. On windows it can be a bit more. When an exception occurs, you should be able to run a script from it to restart processes automatically. My dream system is allowing me to interact with servers via SMS if the server sees it as an exception that I have to either permit, or one that will happen automatically unless I cancel by sms. One day..

4) Remote Power : Ensure Remote power-reset capabilities are in your hand. You might want to schedule weekly reboots if you ever use windows for anything.

5) Business Logic Testing : Have regularly running scripts testing the workflow of your system. Selenium can probably achieve some of this, but I like logging the results as well to say this ran at this time and these files had errors. If possible anywhere, have the system monitor itself through your scripts.

6) Backups : Make a backup that you can set and forget. If you can get things into virtual machines it would be ideal as you can scale, move, or deploy any part of your infrastructure anywhere. I have had instances where I moved a dead server onto my laptop, let it run in vmware while I fixed a problem.


Monitoring the number of connections to your Web server and your database is another good thing to track. Chances are if one shoots through the roof, something is starving for resources and the site is about to go down.

Also make sure you have a regular request for a URL that is a reasonable end-to-end test of the system. If your site supports search, then have nagios execute a search - that should make sure the search index is healthy, the Web server and the database server.

Also, make sure that your applications sends you email anytime your users see an error, or there is an unhandled exception. That way you know how the application is failing in the field.


If I had to pick one type of testing it would be to test the end-user functionality of the system. The important thing to consider is the user. While testing things like database availability, server up-time, etc, are all important, testing work-flows through your system via a remote UI testing system covers all these bases. If you know that the critical parts of your system are available to the end-user, then you know your system is prolly Ok.

  1. Identify the important work-flows in your system. For example, if you wrote an eCommerce site you might identify a work-flow of "search for a product, put product in shopping cart, and purchase product".
  2. Prioritize the work-flows, and build out higher-priority tests first. You can always add additional tests after you roll out to production.
  3. Build UI tests using one of the available UI testing frameworks. There are a number of free and commercial UI testing frameworks that can be run in an automated fashion. Build a core set of tests first that address critical work-flows.
  4. Setup at least one remote location from which to run tests. You want to test every aspect of your system, which means testing it remotely. Is the internet connection up? Is the web server running? Is the connection to the database server working? Etc, etc. If you test remotely you make sure you system is available to the outside world which means it is most likely working end-to-end. You can also run these tests internally, but I think it is critical to run them externally.
  5. Make sure your solution includes both reporting and notification. If one of your critical work-flow tests fails, you want someone to know about it to fix the problem ASAP. If a non-critical task fails, perhaps you only want reporting so that you can fix problems out-of-band.

This end-user testing should not eliminate monitoring of system in your data-center, but I want to reiterate that end-user testing is the most important type of testing you can do for a web application.


Ahhh, monitoring. How I love thee and your vibrations at 3am.

Essentially, you need a way to inspect the internal state of your application, both at a specific moment, as well as over spans of time (the latter is very important for detecting problems before they occur). Another way to think of it is as glorified unit-testing.

We have our own (very nice) monitoring system, so I can't comment on Nagios or other apps. Our use case is similar to yours, though (cgi app on apache).

  1. Add a logging.monitor() type method, which will log information to disk. This should support, at the least, logging simple numbers and dicts of numbers (the key=>value association can be incredibly handy).
  2. Have a process that scrapes the monitoring logs and stores them into a database.
  3. Have a process that takes the database information, checks them against rules, and sends out alerts. Keep in mind that somethings can be flaky. Just because you got a 404 once doesn't mean the app it down.
  4. Have a way to mute alerts (very useful for maintenance or to read your email).

Thats all pretty high level. The important thing is that you have a history of the state of the application over time. From this, you can then create rules (perhaps just raw sql queries you put into a config somewhere), that say "If the queries per second doubled, send a SlashDotted alert", or "if 50% of responses are 404, send an alert". It also bedazzles management because you can quantify any comment about whether its up, down, fast, or slow.

Things to monitor include (others probably mentioned these as well): http status, port accessible, http load, database load, open connection, query latency, server accessibility (ssh, ping), queries per second, number of worker processes, error percentage, error rate.

Simple end-to-end tests are also very handy, though they can be brittle. Its best to keep them simple, but you should have one that tries to touch core pieces of the app (caching, database, authentication).