Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Managing a Large Number of Log Files Distributed Over Many Machines

We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently.

Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We also redirect stdout to a separate file to catch thread dumps and similar.

This results in about 200 different log files.

As of now we have no tooling to assist in managing these files. In the following cases this causes us serious headaches.

  • Troubleshooting when we do not beforehand know in which process the problem occurred. In this case we currently log into each machine using ssh and start using grep.

  • Trying to be proactive by regularly checking the logs for anything out of the ordinary. In this case we also currently log in to all machines and look at different logs using less and tail.

  • Setting up alerts. We are looking to setup alerts on events over a threshold. This is looking to be a pain with 200 log files to check.

Today we have only about 5 log events per second, but that will increase as we migrate more and more code to the new platform.

I would like to ask the community the following questions.

  • How have you handled similar cases with many log files distributed over several machines logged through different frameworks?
  • Why did you choose that particular solution?
  • How did your solutions work out? What did you find good and what did you find bad?

Many thanks.

Update

We ended up evaluating a trial version of Splunk. We are very happy with how it works and have decided to purchase it. Easy to set up, fast searches and a ton of features for the technically inclined. I can recommend anyone in similar situations to check it out.

like image 488
K Erlandsson Avatar asked Oct 25 '10 13:10

K Erlandsson


People also ask

When implementing a log management program what is the best to start?

The first and most important step when getting started with log management is to set a strategy. Don't start logging "just because", hoping somehow, down the line, your organization will profit.

What is the importance of maintaining clean log files that are well formatted?

Preventing errors from ever getting to a production site is often the most efficient and cost effective answer, but bugs happen. Keeping an eye on the logs for all your applications will help ensure your end user has a better experience, and your hardware/applications are performing their best.

How are logs maintained and monitored?

The log monitors scan the log files and search for known text patterns and rules that indicate important events. Once an event is detected, the monitoring system will send an alert, either to a person or to another software/hardware system. Monitoring logs help to identify security events that occurred or might occur.

Where should log files be stored?

Logs are stored as a file on the Log Server. A separate folder is created for the logged events each hour. The log files are stored by default in the <installation directory>/data/ storage/ directory on the Log Server.


1 Answers

I would recommend to pipe all your java logging to Simple Logging Facade for Java (SLF4J) and then redirect all logs from SLF4J to LogBack. SLF4J has special support for handling all popular legacy APIs (log4j, commons-logging, java.util.logging, etc), see here.

Once you have your logs in LogBack you can use one of it's many appenders to aggregate logs over several machines, for details, see the manual section about appenders. Socket, JMS and SMTP seem to be the most obvious candidates.

LogBack also has built-in support for monitoring for special conditions in log files and filtering events sent to particular appender. So you could set up SMTP appender to send you an e-mail every time there is an ERROR level event in logs.

Finally, to ease troubleshooting, be sure to add some sort of requestID to all your incoming "requests", see my answer to this question for details.

EDIT: you could also implement your own custom LogBack appender and redirect all logs to Scribe.

like image 81
Neeme Praks Avatar answered Oct 03 '22 06:10

Neeme Praks