Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing binary output from Process.StandardOutput

Tags:

c#

process

binary

In C# (.NET 4.0 running under Mono 2.8 on SuSE) I would like to run an external batch command and capture its ouput in binary form. The external tool I use is called 'samtools' (samtools.sourceforge.net) and among other things it can return records from an indexed binary file format called BAM.

I use Process.Start to run the external command, and I know that I can capture its output by redirecting Process.StandardOutput. The problem is, that's a text stream with an encoding, so it doesn't give me access to the raw bytes of the output. The almost-working solution I found is to access the underlying stream.

Here's my code:

        Process cmdProcess = new Process();         ProcessStartInfo cmdStartInfo = new ProcessStartInfo();         cmdStartInfo.FileName = "samtools";          cmdStartInfo.RedirectStandardError = true;         cmdStartInfo.RedirectStandardOutput = true;         cmdStartInfo.RedirectStandardInput = false;         cmdStartInfo.UseShellExecute = false;         cmdStartInfo.CreateNoWindow = true;          cmdStartInfo.Arguments = "view -u " + BamFileName + " " + chromosome + ":" + start + "-" + end;          cmdProcess.EnableRaisingEvents = true;         cmdProcess.StartInfo = cmdStartInfo;         cmdProcess.Start();          // Prepare to read each alignment (binary)         var br = new BinaryReader(cmdProcess.StandardOutput.BaseStream);          while (!cmdProcess.StandardOutput.EndOfStream)         {             // Consume the initial, undocumented BAM data              br.ReadBytes(23); 

// ... more parsing follows

But when I run this, the first 23bytes that I read are not the first 23 bytes in the ouput, but rather somewhere several hundred or thousand bytes downstream. I assume that StreamReader does some buffering and so the underlying stream is already advanced say 4K into the output. The underlying stream does not support seeking back to the start.

And I'm stuck here. Does anyone have a working solution for running an external command and capturing its stdout in binary form? The ouput may be very large so I would like to stream it.

Any help appreciated.

By the way, my current workaround is to have samtools return the records in text format, then parse those, but this is pretty slow and I'm hoping to speed things up by using the binary format directly.

like image 514
Sten Linnarsson Avatar asked Nov 10 '10 10:11

Sten Linnarsson


2 Answers

Using StandardOutput.BaseStream is the correct approach, but you must not use any other property or method of cmdProcess.StandardOutput. For example, accessing cmdProcess.StandardOutput.EndOfStream will cause the StreamReader for StandardOutput to read part of the stream, removing the data you want to access.

Instead, simply read and parse the data from br (assuming you know how to parse the data, and won't read past the end of stream, or are willing to catch an EndOfStreamException). Alternatively, if you don't know how big the data is, use Stream.CopyTo to copy the entire standard output stream to a new file or memory stream.

like image 97
Bradley Grainger Avatar answered Sep 19 '22 15:09

Bradley Grainger


Since you explicitly specified running on Suse linux and mono, you can work around the problem by using native unix calls to create the redirection and read from the stream. Such as:

using System; using System.Diagnostics; using System.IO; using Mono.Unix;  class Test {     public static void Main()     {         int reading, writing;         Mono.Unix.Native.Syscall.pipe(out reading, out writing);         int stdout = Mono.Unix.Native.Syscall.dup(1);         Mono.Unix.Native.Syscall.dup2(writing, 1);         Mono.Unix.Native.Syscall.close(writing);          Process cmdProcess = new Process();         ProcessStartInfo cmdStartInfo = new ProcessStartInfo();         cmdStartInfo.FileName = "cat";         cmdStartInfo.CreateNoWindow = true;         cmdStartInfo.Arguments = "test.exe";         cmdProcess.StartInfo = cmdStartInfo;         cmdProcess.Start();          Mono.Unix.Native.Syscall.dup2(stdout, 1);         Mono.Unix.Native.Syscall.close(stdout);          Stream s = new UnixStream(reading);         byte[] buf = new byte[1024];         int bytes = 0;         int current;         while((current = s.Read(buf, 0, buf.Length)) > 0)         {             bytes += current;         }         Mono.Unix.Native.Syscall.close(reading);         Console.WriteLine("{0} bytes read", bytes);     } } 

Under unix, file descriptors are inherited by child processes unless marked otherwise (close on exec). So, to redirect stdout of a child, all you need to do is change the file descriptor #1 in the parent process before calling exec. Unix also provides a handy thing called a pipe which is a unidirectional communication channel, with two file descriptors representing the two endpoints. For duplicating file descriptors, you can use dup or dup2 both of which create an equivalent copy of a descriptor, but dup returns a new descriptor allocated by the system and dup2 places the copy in a specific target (closing it if necessary). What the above code does, then:

  1. Creates a pipe with endpoints reading and writing
  2. Saves a copy of the current stdout descriptor
  3. Assigns the pipe's write endpoint to stdout and closes the original
  4. Starts the child process so it inherits stdout connected to the write endpoint of the pipe
  5. Restores the saved stdout
  6. Reads from the reading endpoint of the pipe by wrapping it in a UnixStream

Note, in native code, a process is usually started by a fork+exec pair, so the file descriptors can be modified in the child process itself, but before the new program is loaded. This managed version is not thread-safe as it has to temporarily modify the stdout of the parent process.

Since the code starts the child process without managed redirection, the .NET runtime does not change any descriptors or create any streams. So, the only reader of the child's output will be the user code, which uses a UnixStream to work around the StreamReader's encoding issue,

like image 34
Jester Avatar answered Sep 21 '22 15:09

Jester