I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files.
Currently log files are written locally to the disk on each blade but we could also consider publishing logging information using for example a UDP Appender and collect it centally.
Given that the objective is to be able to identify problems in as close to real time as possible, what should we do?
First, synchronize all clocks in the system using NTP.
Second, if you are collecting the logs in a single location (like the UDP appender you mention) make sure the logs have enough information to actually help. I would include at least the server that generated the log, the time it happened, and the message. If there is any sort of transaction id, or job id type concept, include that also.
Since you mentioned a UDP Appender I am guessing you are using log4j (or one of it's siblings). Log4j has an MDC class that allows extra information to be passed along through a processing thread. it can help collect some of the extra information and pass it along.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With