Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to save thousands of files in VB.NET? [closed]

I'm downloading thousands of files every second. Each file is about 5KB and the total download speed is about 200Mb/s. I need to save all of these files.

The download process is split between thousands of different async tasks that are running. When they finish downloading a file and want to save it, they add it to a queue of files to save.

Here is what the class for this looks like. I create an instance of this class at very beginning, and have my tasks add files that need to be saved to the queue.

Public Class FileSaver

Structure FileToSave
    Dim path As String
    Dim data() As Byte
End Structure

Private FileQueue As New Concurrent.BlockingCollection(Of FileToSave)

Sub New()
    Task.Run(
        Async Function()

            While 1
                Dim fl As FileToSave = FileQueue.Take()
                Using sourceStream As New FileStream(fl.path, FileMode.Append, FileAccess.Write, FileShare.None, bufferSize:=4096, useAsync:=True)
                        Await sourceStream.WriteAsync(fl.data, 0, fl.data.Length)
                End Using
            End While

        End Function
    )
End Sub

Public Sub Add(path As String, data() As Byte)
    Dim fl As FileToSave
    fl.path = path
    fl.data = data
    FileQueue.Add(fl)
End Sub

Public Function Count()
    Return FileQueue.Count
End Function

End Class

There is only one instance of this class, there is only one queue. Each task does not create a separate queue. There is one global instance of this class with an internal queue, and all my tasks add the files to this single queue.

I've since replaced the ConcurrentQueue with a default BlockingCollection, which should work just like a ConcurrentQueue, but allow me to have a blocking Take() from the collection, without having to constantly loop.

The hard disk I'm using supports ~180MB/s maximum read/write speeds. I'm only downloading at 200Mb/s, and I don't seem to be able to save the data fast enough as the queue keeps growing. Something is wrong, and i can't seem to figure out what.

Is this the best (fastest) way to do it? Could I create any improvements here?


EDIT: This question was put on hold, and i can't post my own answer with what i figured out. I'll post it here.

The problem here is that while writing to a file is a relatively cheap process, opening a file for writing is not. Since I was downloading thousands of files, i was saving each one separately, which was significantly hurting performance.

What i did instead was group multiple downloaded files (while they were still in RAM) together into one file (with delimiters), and write that file to disk. The files I'm downloading have some properties that allow them to be logically grouped in this way, and still used later. The ratio is about 100:1.

I no longer seem to be write-bound, and I'm currently saving at ~40MB/s, if i hit another premature limit, I'll update this. Hope this helps someone.


EDIT2: More progress on my goal to faster IO.

Since I'm now combining multiple files into one, this means that I'm performing a total of 1 open (CreateFile) operation, and then multiple writes to an opened file. This is good, but still not optimal. It is better to do one 10MB write rather than ten 1MB writes. Multiple writes are slower, and cause disk fragmentation which later slows down reads as well. Not good.

So the solution was to buffer all (or as many as i can) downloaded files in RAM, and then once I've hit some point, write them all to the single file with one Write operation. I have ~50GB of RAM, so this works great for me.

However, now there is another problem. Since I'm now manually buffering my write data to do as few Write operations as possible, the Windows cache becomes somewhat redundant and actually starts slowing stuff down, and eating up RAM. Lets get rid of it.

The solution to this is to do unbuffered (and async) I/O, which is supported by Windows' CreateFile(). But not easily supported in .NET. I had to use a library (the only one that seems to exist) to accomplish this that you can find here: http://programmingaddicted.blogspot.com/2011/05/unbuffered-overlapped-io-in-net.html

That allows for simple unbuffered asynchronous IO from .NET. The only requirement is that you now have to manually sector-align your byte() buffers otherwise WriteFile() will fail with an "Invalid Parameter" error. In my case this just required aligning my buffers to a multiple of 512.

After all of this, i was able to hit ~110MB/s write speed to my drive. Much better than i expected.

like image 988
David Davidson Avatar asked Jul 13 '15 19:07

David Davidson


1 Answers

I would suggest that you look into TPL DataFlow. It looks like you want to create a producer/consumer.

The beauty of using the TPL DataFlow over your current implementation is that you can Specify the degree of parallelism. This will allow you to play with the numbers to best tune your solution to meet your needs.

As @Graffito mentions, if you are using spinning platters, writing may be limited by number of files concurrently being written to, which makes this a trial and error to best tune performance.

You, of course, could write your own mechanism to limit concurrency.

I hope that this helps.

[Additional] I worked at a company which archived email which had similar requirements of writing to disk. That company had issues with io speeds when there were too many files in a directory. As a result, they chose to limit the files to a 1000 files/folders per directory. That decision was before my time, but may be relevant to your project.

like image 169
Phillip Scott Givens Avatar answered Nov 14 '22 20:11

Phillip Scott Givens