Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Log files in massively distributed systems

I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files.

Currently log files are written locally to the disk on each blade but we could also consider publishing logging information using for example a UDP Appender and collect it centally.

Given that the objective is to be able to identify problems in as close to real time as possible, what should we do?

like image 287
John Channing Avatar asked Aug 29 '08 21:08

John Channing


1 Answers

First, synchronize all clocks in the system using NTP.

Second, if you are collecting the logs in a single location (like the UDP appender you mention) make sure the logs have enough information to actually help. I would include at least the server that generated the log, the time it happened, and the message. If there is any sort of transaction id, or job id type concept, include that also.

Since you mentioned a UDP Appender I am guessing you are using log4j (or one of it's siblings). Log4j has an MDC class that allows extra information to be passed along through a processing thread. it can help collect some of the extra information and pass it along.

like image 112
John Meagher Avatar answered Jan 01 '23 19:01

John Meagher