Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading multiple files by fastly and efficiently(async)

I have so many files that i have to download. So i try to use power of new async features as below.

var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();

var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
    using (var fileStream = new FileStream("blabla", FileMode.Create))
    {
        await stream.CopyToAsync(fileStream);
    }
}

What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?

I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?

like image 706
Freshblood Avatar asked Dec 26 '22 08:12

Freshblood


2 Answers

Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).

I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.

var block = new ActionBlock<string>(url =>
    {
        var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
        using (var fileStream = new FileStream("blabla", FileMode.Create))
        {
            await stream.CopyToAsync(fileStream);
        }
    },
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
like image 175
i3arnon Avatar answered Apr 27 '23 05:04

i3arnon


Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.

The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.

If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.

like image 42
Yuval Itzchakov Avatar answered Apr 27 '23 05:04

Yuval Itzchakov