Log files in massively distributed systems

Question

I do a lot of work in the grid and HPC space and one of the biggest challenges we have with a system distributed across hundreds (or in some case thousands) of servers is analysing the log files.

Currently log files are written locally to the disk on each blade but we could also consider publishing logging information using for example a UDP Appender and collect it centally.

Given that the objective is to be able to identify problems in as close to real time as possible, what should we do?

John Meagher · Accepted Answer

First, synchronize all clocks in the system using NTP.

Second, if you are collecting the logs in a single location (like the UDP appender you mention) make sure the logs have enough information to actually help. I would include at least the server that generated the log, the time it happened, and the message. If there is any sort of transaction id, or job id type concept, include that also.

Since you mentioned a UDP Appender I am guessing you are using log4j (or one of it's siblings). Log4j has an MDC class that allows extra information to be passed along through a processing thread. it can help collect some of the extra information and pass it along.

Log files in massively distributed systems

Tags:

distributed-computing

hpc

John Channing

1 Answers

John Meagher

Recent Activity

Donate For Us

Log files in massively distributed systems

Tags:

distributed-computing

hpc

John Channing

1 Answers

John Meagher

Related questions

Recent Activity

Donate For Us