Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing an async "read all currently available data from stream" operation

I recently provided an answer to this question: C# - Realtime console output redirection.

As often happens, explaining stuff (here "stuff" was how I tackled a similar problem) leads you to greater understanding and/or, as is the case here, "oops" moments. I realized that my solution, as implemented, has a bug. The bug has little practical importance, but it has an extremely large importance to me as a developer: I can't rest easy knowing that my code has the potential to blow up.

Squashing the bug is the purpose of this question. I apologize for the long intro, so let's get dirty.

I wanted to build a class that allows me to receive input from a console's standard output Stream. Console output streams are of type FileStream; the implementation can cast to that, if needed. There is also an associated StreamReader already present to leverage.

There is only one thing I need to implement in this class to achieve my desired functionality: an async "read all the data available this moment" operation. Reading to the end of the stream is not viable because the stream will not end unless the process closes the console output handle, and it will not do that because it is interactive and expecting input before continuing.

I will be using that hypothetical async operation to implement event-based notification, which will be more convenient for my callers.

The public interface of the class is this:

public class ConsoleAutomator {
    public event EventHandler<ConsoleOutputReadEventArgs> StandardOutputRead;

    public void StartSendingEvents();
    public void StopSendingEvents();
}

StartSendingEvents and StopSendingEvents do what they advertise; for the purposes of this discussion, we can assume that events are always being sent without loss of generality.

The class uses these two fields internally:

    protected readonly StringBuilder inputAccumulator = new StringBuilder();

    protected readonly byte[] buffer = new byte[256];

The functionality of the class is implemented in the methods below. To get the ball rolling:

    public void StartSendingEvents();
    {
        this.stopAutomation = false;
        this.BeginReadAsync();
    }

To read data out of the Stream without blocking, and also without requiring a carriage return char, BeginRead is called:

    protected void BeginReadAsync()
    {
        if (!this.stopAutomation) {
            this.StandardOutput.BaseStream.BeginRead(
                this.buffer, 0, this.buffer.Length, this.ReadHappened, null);
        }
    }

The challenging part:

BeginRead requires using a buffer. This means that when reading from the stream, it is possible that the bytes available to read ("incoming chunk") are larger than the buffer. Remember that the goal here is to read all of the chunk and call event subscribers exactly once for each chunk.

To this end, if the buffer is full after EndRead, we don't send its contents to subscribers immediately but instead append them to a StringBuilder. The contents of the StringBuilder are only sent back whenever there is no more to read from the stream.

    private void ReadHappened(IAsyncResult asyncResult)
    {
        var bytesRead = this.StandardOutput.BaseStream.EndRead(asyncResult);
        if (bytesRead == 0) {
            this.OnAutomationStopped();
            return;
        }

        var input = this.StandardOutput.CurrentEncoding.GetString(
            this.buffer, 0, bytesRead);
        this.inputAccumulator.Append(input);

        if (bytesRead < this.buffer.Length) {
            this.OnInputRead(); // only send back if we 're sure we got it all
        }

        this.BeginReadAsync(); // continue "looping" with BeginRead
    }

After any read which is not enough to fill the buffer (in which case we know that there was no more data to be read during the last read operation), all accumulated data is sent to the subscribers:

    private void OnInputRead()
    {
        var handler = this.StandardOutputRead;
        if (handler == null) {
            return;
        }

        handler(this, 
                new ConsoleOutputReadEventArgs(this.inputAccumulator.ToString()));
        this.inputAccumulator.Clear();
    }

(I know that as long as there are no subscribers the data gets accumulated forever. This is a deliberate decision).

The good

This scheme works almost perfectly:

  • Async functionality without spawning any threads
  • Very convenient to the calling code (just subscribe to an event)
  • Never more than one event for each time data is available to be read
  • Is almost agnostic to the buffer size

The bad

That last almost is a very big one. Consider what happens when there is an incoming chunk with length exactly equal to the size of the buffer. The chunk will be read and buffered, but the event will not be triggered. This will be followed up by a BeginRead that expects to find more data belonging to the current chunk in order to send it back all in one piece, but... there will be no more data in the stream.

In fact, as long as data is put into the stream in chunks with length exactly equal to the buffer size, the data will be buffered and the event will never be triggered.

This scenario may be highly unlikely to occur in practice, especially since we can pick any number for the buffer size, but the problem is there.

Solution?

Unfortunately, after checking the available methods on FileStream and StreamReader, I can't find anything which lets me peek into the stream while also allowing async methods to be used on it.

One "solution" would be to have a thread wait on a ManualResetEvent after the "buffer filled" condition is detected. If the event is not signaled (by the async callback) in a small amount of time, then more data from the stream will not be forthcoming and the data accumulated so far should be sent to subscribers. However, this introduces the need for another thread, requires thread synchronization, and is plain inelegant.

Specifying a timeout for BeginRead would also suffice (call back into my code every now and then so I can check if there's data to be sent back; most of the time there will not be anything to do, so I expect the performance hit to be negligible). But it looks like timeouts are not supported in FileStream.

Since I imagine that async calls with timeouts are an option in bare Win32, another approach might be to PInvoke the hell out of the problem. But this is also undesirable as it will introduce complexity and simply be a pain to code.

Is there an elegant way to get around the problem?

Thanks for being patient enough to read all of this.

Update:

I definitely did not communicate the scenario well in my initial writeup. I have since revised the writeup quite a bit, but to be extra sure:

The question is about how to implement an async "read all the data available this moment" operation.

My apologies to the people who took the time to read and answer without me making my intent clear enough.

like image 801
Jon Avatar asked Dec 21 '10 23:12

Jon


1 Answers

In theory, I agree with Jason; your implementation has bigger problems than having a logic hole in the case of a chunk of data being evenly divisible by your buffer. The biggest problem I see is that your reader must have enough knowledge about the file type to know how it can separate data into "chunks" that your subscribers know how to deal with.

Streams have no inherent knowledge about what they're receiving or sending; only the mechanism by which they are transporting the data. A NetworkStream may be sending HTML or a ZIP file; a FileStream may be reading a text file or an MP3. It's the reader (XmlReader, TextReader, Image.FromStream(), etc) that has this knowledge. Therefore, your async reader has to know at least something about the data, but it would be useful not to have that knowledge hard-coded.

In order to work with "streaming" data, incremental sends must be individually useful; you must know enough about what you're getting to know that what you've gotten is a "chunk" that is individually processable. My suggestion is to provide that information to your async reader in an encapsulated fashion, either by having your subscribers tell you, or by providing some format-specific "chunkifier" seperate from the listeners (as this reader is listening to console output, and all listeners should treat it the same way, this second plan may be better).

A logical implementation:

public class MyStreamManager {
    public delegate bool ValidChunkTester(StringBuilder builder);

    private readonly List<ValidChunkTester> validators = new List<ValidChunkTester>();
    public event ValidChunkTester IsValidChunk
    { add{validators.Add(value);} remove {validators.Remove(value);}}

    public event EventHandler<ConsoleOutputReadEventArgs> StandardOutputRead;


    public void StartSendingEvents();
    public void StopSendingEvents();
}

...

private void ReadHappened(IAsyncResult asyncResult)
{
    var bytesRead = this.StandardOutput.BaseStream.EndRead(asyncResult);
    if (bytesRead == 0) {
        this.OnAutomationStopped();
        return;
    }

    var input = this.StandardOutput.CurrentEncoding.GetString(
        this.buffer, 0, bytesRead);
    this.inputAccumulator.Append(input);

    if (validators.Any() && StandardOutputRead !-= null 
            && validators.Aggregate(true, (valid, validator)=>valid && validator(inputAccumulator))) {
        this.OnInputRead(); // send when all listeners can work with the buffer contents
    }

    this.BeginReadAsync(); // continue "looping" with BeginRead
}

...

This model requires that subscribers not modify the StringBuilder; you can provide something immutable for them to examine if you choose. An example listener might be:

public bool IsACompleteLine(StringBuilder builder)
{
    return builder.Contains(Environment.NewLine);
}

or:

public bool Contains256Bytes(StringBuilder builder)
{
    return builder.Length >= 256;
}

... you get the idea. The event determining the worthiness of the current buffer to be released to listeners is conceptually separate from the listeners themselves, but doesn't have to be concretely so, so it will support either a single output-specific test or multiple listener-based tests.

like image 170
KeithS Avatar answered Sep 22 '22 02:09

KeithS