Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Risk of missing events from ETW logging with EventSource

Tags:

I'm instrumenting my .NET 4.5 applications to emit ETW events using the EventSource class. The goal is to be able to capture some of these events (the Error level events) for error logging.

After doing some reading and testing, I am concerned about the reliability of this approach to error logging, specifically regarding the possibility of dropped or missing events. If my error logging isn't working I need the application to shut down (in my case it is unsafe for it to run with unreported errors). When using ETW and EventSource, how can I be certain that my errors are getting properly recorded?

Obviously part of the answer will depend on what is listening to the events. In my case I plan to use the "Semantic Logging Application Block" from the latest MS Enterprise Library.

Here is one source where Microsoft talks about the possible causes of missed events: About Event Tracing

There they list these possible causes of Missing Events

  • The total event size is greater than 64K. This includes the ETW header plus the data or payload. A user has no control over these missing events since the event size isconfigured by the application.

  • The ETW buffer size is smaller than the total event size. A user has no control over these missing events since the event size is configured by the application logging the events.

  • For real-time logging, the real-time consumer is not consuming events fast enough or is not present altogether and then the backing file is filling up. This can result if the Event Log service is stopped and started when events are being logged. A user has no control over these missing events.

  • When logging to a file, the disk is too slow to keep up with the logging rate.

To see if these concerns were somehow mitigated using the EventSource class (like, does it truncate large payloads somehow) I did some testing. I tried to log long strings and it failed for me between 30,000 and 35,000 characters (right in line with the 64KB maximum event payload). It just silently does nothing from what I can tell for the too-large strings, no events at all in my Semantic Logging Application Block log. Events before and after were written like usual.

So anytime I have a string in my payload I have to pass it through some truncator? Will I need to manually avoid generating events "too fast" as well (and how would that be possible)?

Microsoft Patterns and Practices are supposed to lead us to good ... patterns and practices ... so maybe I'm just missing something here.

Update:

Well apparently there is some notice in the consuming application for the "Events too fast" condition. I received this today for the first time:

Level : Warning, Message : Some events will be lost because of buffer overruns or schema synchronization delays in trace session: Microsoft-SemanticLogging-Etw-svcRuntime

And then when closing the session:

Level : Warning, Message : The loss of 1 events was detected in trace session 'Microsoft-SemanticLogging-Etw-svcRuntime'.

Update2:

The Enterprise Library Developers Guide describes the behavior I just mentioned.

You should monitor the log messages generated by the Semantic Logging Application Block for any indication that the have buffers overflowed and that you have lost messages. For example, log messages with event ids 900 and 901 indicate that a sink’s internal buffers have overflowed; in the out-of-process scenario, event ids 806 and 807 indicate that the ETW buffers have overflowed. You can modify the buffering configuration options for the sinks to reduce the chance that the buffers overflow with your typical workloads.

My question remains, can I use semantic logging while ensuring my application does not run if errors are being dropped? Normal tracing events could be dropped ...

My current thought is to log "critical" errors with a separate class using old-fashioned logging techniques, and keep less critical errors (as well as debug type events) going through the ETW pipeline. That wouldn't be too bad really ... I might post that as a solution if I can't find a better suggestion.

Update 3:

The "missing events" warning that I received had nothing to do with buffer overruns, turns out this is the message you get if you pass a null string as a payload value.

like image 727
TCC Avatar asked Jun 22 '13 00:06

TCC


People also ask

Which files will you use to know the ETW events generated by any process?

etl file to generate traces that can be read by the Service Trace Viewer. This file can be found in the C:\logs folder.

What is an event trace log?

Event Tracing for Windows (ETW) is an efficient kernel-level tracing facility that lets you log kernel or application-defined events to a log file. You can consume the events in real time or from a log file and use them to debug an application or to determine where performance issues are occurring in the application.


2 Answers

The EventSource class comes in two versions, one included with the .NET Framework and another in the NuGet package Microsoft EventSource Library. I assume that you use the NuGet package because it contains newer code.

The constructor for the EventSource base class has an overload that takes a boolean argument throwOnEventWriteErrors with the following documentation (NuGet package version 1.0.26.0):

By default calling the 'WriteEvent' methods do NOT throw on errors (they silently discard the event). This is because in most cases users assume logging is not 'precious' and do NOT wish to have logging failures crash the program. However for those applications where logging is 'precious' and if it fails the caller wishes to react, setting 'throwOnEventWriteErrors' will cause an exception to be thrown if WriteEvent fails. Note the fact that EventWrite succeeds does not necessarily mean that the event reached its destination only that operation of writing it did not fail.

Unfortunately, the last sentence contains a caveat emptor but if you look into the source code for the EventSource you can see that the underlying return codes from the OS calls are used to throw different exceptions for NoFreeBuffers and EventTooBig (and other errors).

So if you turn on throwOnEventWriteErrors you will get exceptions if the EventSource class is unable to deliver the event to ETW. However, if ETW fails for another reason you wont get any exception but if you ensure that your ETW channels are configured correctly that should rarely if ever happen. However, as you cannot tolerate losing any error events you should probably test extreme error cases to make sure that ETW behaves as you expect.

like image 137
Martin Liversage Avatar answered Sep 21 '22 09:09

Martin Liversage


One thing that there are two important points that are not made clear in the discussion above.

  1. ALL the issues associated with dropped events have to do with ETW (Event Tracing for Windows), not EventSource. That is logically EventSOurces talk to EventListeners, and there is built-in listener that forwards to ETW. Obviously when you talk about dropped events, the constraint of ANY link in the chain will affect data flowing through the chain. Thus one way of guaranteeing complete reliability is to use an EventListener that does not use ETW but goes directly to wherever you want the data to go. I believe that the (Semantic Logging Application Block) has such listener.

  2. ETW has been successfully used to reliably forward events, but you do have to live within the constraints mentioned above (the size of the events has to be kept < 64K and you have to keep the event rate under control. Note that if the rate is too high you will know this because the WriteEvent will fail, so you can retry (after pausing), and thus make something that fully reliable (at the expense of slowing down the program). Note that this kind of data loss is just not an interesting issue if you are truly talking about errors (which should not be happening at a huge rate, and if they are happening at a high rate, they are likely to be redundant (the same thing firing rapidly).

So in conclusion EventSource supports reliable events by default, ETW does not support it by default but can be made to support it, but often, ETW's defaults are more than fine.

like image 26
Vance Morrison Avatar answered Sep 20 '22 09:09

Vance Morrison