Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing To A File With Multiple Streams C#

I am trying to download a large file (>1GB) from one server to another using HTTP. To do this I am making HTTP range requests in parallel. This lets me download the file in parallel.

When saving to disk I am taking each response stream, opening the same file as a file stream, seeking to the range I want and then writing.

However I find that all but one of my response streams times out. It looks like the disk I/O cannot keep up with the network I/O. However, if I do the same thing but have each thread write to a separate file it works fine.

For reference, here is my code writing to the same file:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

And here is the code writing to multiple files:

int numberOfStreams = 4;
List<Tuple<int, int>> ranges = new List<Tuple<int, int>>();
string fileName = @"C:\MyCoolFile.txt";
//List populated here
Parallel.For(0, numberOfStreams, (index, state) =>
{
    try
    {
        HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create("Some URL");
        using(Stream responseStream = webRequest.GetResponse().GetResponseStream())
        {
            using (FileStream fileStream = File.Open(fileName + "." + index + ".tmp", FileMode.OpenOrCreate, FileAccess.Write, FileShare.Write))
            {
                fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);
                byte[] buffer = new byte[64 * 1024];
                int bytesRead;
                while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    if (state.IsStopped)
                    {
                        return;
                    }
                    fileStream.Write(buffer, 0, bytesRead);
                }
            }
        };
    }
    catch (Exception e)
    {
        exception = e;
        state.Stop();
    }
});

My question is this, is there some additional checks/actions that C#/Windows takes when writing to a single file from multiple threads that would cause the file I/O to be slower than when writing to multiple files? All disk operations should be bound by the disk speed right? Can anyone explain this behavior?

Thanks in advance!

UPDATE: Here is the error the source server is throwing:

"Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." [System.IO.IOException]: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." InnerException: "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" Message: "Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." StackTrace: " at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)\r\n at System.Net.Security._SslStream.StartWriting(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security._SslStream.ProcessWrite(Byte[] buffer, Int32 offset, Int32 count, AsyncProtocolRequest asyncRequest)\r\n at System.Net.Security.SslStream.Write(Byte[] buffer, Int32 offset, Int32 count)\r\n

like image 230
shortspider Avatar asked Jul 31 '15 17:07

shortspider


2 Answers

Unless you're writing to a striped RAID, you're unlikely to experience performance benefits by writing to the file from multiple threads concurrently. In fact, it's more likely to be the opposite – the concurrent writes would get interleaved and cause random access, incurring disk seek latencies that makes them orders of magnitude slower than large sequential writes.

To get a sense of perspective, look at some latency comparisons. A sequential 1 MB read from disk takes 20 ms; writes take approximately the same time. Each disk seek, on the other hands, takes around 10 ms. If your writes are interleaved at 4 KB chunks, then your 1 MB write will require an additional 2560 ms of seek time, making it 100 times slower than sequential.

I would suggest only allowing one thread to write to the file at any time, and use parallelism just for the network transfer. You can use a producer–consumer pattern where downloaded chunks are written to a bounded concurrent collection (such as BlockingCollection<T>), which then get picked up and written to disk by a dedicated thread.

like image 149
Douglas Avatar answered Sep 21 '22 11:09

Douglas


    fileStream.Seek(ranges[index].Item1, SeekOrigin.Begin);

That Seek() call is a problem, you'll seek to a part of the file that's very far removed from the current end-of-file. Your next fileStream.Write() call forces the file system to extend the file on disk, filling the unwritten parts of it with zeros.

This can take a while, your thread will be blocked until the file system is done extending the file. Might well be long enough to trigger a timeout. You'd see this go wrong early at the start of the transfer.

A workaround is to create and fill the entire file before you start writing real data. Otherwise a very common strategy used by downloaders, you might have seen .part files before. Another nice benefit is that you have a decent guarantee that the transfer cannot fail because the disk ran out of space. Beware that filling a file with zeros is only cheap when the machine has enough RAM. 1 GB should not be a problem on modern machines.

Repro code:

using System;
using System.IO;
using System.Diagnostics;

class Program {
    static void Main(string[] args) {
        string path = @"c:\temp\test.bin";
        var fs = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Write);
        fs.Seek(1024L * 1024 * 1024, SeekOrigin.Begin);
        var buf = new byte[4096];
        var sw = Stopwatch.StartNew();
        fs.Write(buf, 0, buf.Length);
        sw.Stop();
        Console.WriteLine("Writing 4096 bytes took {0} milliseconds", sw.ElapsedMilliseconds);
        Console.ReadKey();
        fs.Close();
        File.Delete(path);
    }
}

Output:

Writing 4096 bytes took 1491 milliseconds

That was on an fast SSD, a spindle drive is going to take much longer.

like image 39
Hans Passant Avatar answered Sep 19 '22 11:09

Hans Passant