Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle large numbers of concurrent disk write requests as efficiently as possible

Tags:

c#

.net

disk

Say the method below is being called several thousand times by different threads in a .net 4 application. What’s the best way to handle this situation? Understand that the disk is the bottleneck here but I’d like the WriteFile() method to return quickly.

Data can be can be up to a few MB. Are we talking threadpool, TPL or the like?

public void WriteFile(string FileName, MemoryStream Data)
{
   try
   {
      using (FileStream DiskFile = File.OpenWrite(FileName))
      {
         Data.WriteTo(DiskFile);
         DiskFile.Flush();
         DiskFile.Close();
      }
   }
   catch (Exception e)
   {
      Console.WriteLine(e.Message);
   }
}
like image 945
Canacourse Avatar asked Sep 14 '11 19:09

Canacourse


3 Answers

If you want to return quickly and not really care that operation is synchronous you could create some kind of in memory Queue where you will be putting write requests , and while Queue is not filled up you can return from method quickly. Another thread will be responsible for dispatching Queue and writing files. If your WriteFile is called and queue is full you will have to wait until you can queue and execution will become synchronous again, but that way you could have a big buffer so if process file write requests is not linear , but is more spiky instead (with pauses between write file calls spikes) such change can be seen as an improvement in your performance.

UPDATE: Made a little picture for you. Notice that bottleneck always exists, all you can possibly do is optimize requests by using a queue. Notice that queue has limits, so when its filled up , you cannot insta queue files into, you have to wait so there is a free space in that buffer too. But for situation presented on picture (3 bucket requests) its obvious you can quickly put buckets into queue and return, while in first case you have to do that 1 by one and block execution.

Notice that you never need to execute many IO threads at once, since they will all be using same bottleneck and you will just be wasting memory if you try to parallel this heavily, I believe 2 - 10 threads tops will take all available IO bandwidth easily, and will limit application memory usage too.

enter image description here

like image 152
Valentin Kuzub Avatar answered Nov 16 '22 08:11

Valentin Kuzub


Since you say that the files don't need to be written in order nor immediately, the simplest approach would be to use a Task:

private void WriteFileAsynchronously(string FileName, MemoryStream Data)
{
    Task.Factory.StartNew(() => WriteFileSynchronously(FileName, Data));
}

private void WriteFileSynchronously(string FileName, MemoryStream Data)
{
    try
    {
        using (FileStream DiskFile = File.OpenWrite(FileName))
        {
            Data.WriteTo(DiskFile);
            DiskFile.Flush();
            DiskFile.Close();
        }
    }

    catch (Exception e)
    {
        Console.WriteLine(e.Message);
    }
}

The TPL uses the thread pool internally, and should be fairly efficient even for large numbers of tasks.

like image 42
Cameron Avatar answered Nov 16 '22 06:11

Cameron


If data is coming in faster than you can log it, you have a real problem. A producer/consumer design that has WriteFile just throwing stuff into a ConcurrentQueue or similar structure, and a separate thread servicing that queue works great ... until the queue fills up. And if you're talking about opening 50,000 different files, things are going to back up quick. Not to mention that your data that can be several megabytes for each file is going to further limit the size of your queue.

I've had a similar problem that I solved by having the WriteFile method append to a single file. The records it wrote had a record number, file name, length, and then the data. As Hans pointed out in a comment to your original question, writing to a file is quick; opening a file is slow.

A second thread in my program starts reading that file that WriteFile is writing to. That thread reads each record header (number, filename, length), opens a new file, and then copies data from the log file to the final file.

This works better if the log file and the final file are are on different disks, but it can still work well with a single spindle. It sure exercises your hard drive, though.

It has the drawback of requiring 2X the disk space, but with 2-terabyte drives under $150, I don't consider that much of a problem. It's also less efficient overall than directly writing the data (because you have to handle the data twice), but it has the benefit of not causing the main processing thread to stall.

like image 26
Jim Mischel Avatar answered Nov 16 '22 07:11

Jim Mischel