Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

a faster way to download multiple files

Tags:

browser

c#

.net

i need to download about 2 million files from the SEC website. each file has a unique url and is on average 10kB. this is my current implementation:

    List<string> urls = new List<string>();
    // ... initialize urls ...
    WebBrowser browser = new WebBrowser();
    foreach (string url in urls)
    {
        browser.Navigate(url);
        while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents();
        StreamReader sr = new StreamReader(browser.DocumentStream);
        StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/')));
        sw.Write(sr.ReadToEnd());
        sr.Close();
        sw.Close();
    }

the projected time is about 12 days... is there a faster way?

Edit: btw, the local file handling takes only 7% of the time

Edit: this is my final implementation:

    void Main(void)
    {
        ServicePointManager.DefaultConnectionLimit = 10000;
        List<string> urls = new List<string>();
        // ... initialize urls ...
        int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg));
    }

    public int downloadFile(string url)
    {
        int retries = 0;

        retry:
        try
        {
            HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url);
            webrequest.Timeout = 10000;
            webrequest.ReadWriteTimeout = 10000;
            webrequest.Proxy = null;
            webrequest.KeepAlive = false;
            webresponse = (HttpWebResponse)webrequest.GetResponse();

            using (Stream sr = webrequest.GetResponse().GetResponseStream())
            using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/'))))
            {
                sr.CopyTo(sw);
            }
        }

        catch (Exception ee)
        {
            if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.")
            {
                if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the trans­port con­nec­tion: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++;
                else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
                goto retry;
            }
        }

        return retries;
    }
like image 677
o17t H1H' S'k Avatar asked Jan 15 '12 13:01

o17t H1H' S'k


3 Answers

Execute the downloads concurrently instead of sequentially, and set a sensible MaxDegreeOfParallelism otherwise you will try to make too many simultaneous request which will look like a DOS attack:

    public static void Main(string[] args)
    {
        var urls = new List<string>();
        Parallel.ForEach(
            urls, 
            new ParallelOptions{MaxDegreeOfParallelism = 10},
            DownloadFile);
    }

    public static void DownloadFile(string url)
    {
        using(var sr = new StreamReader(HttpWebRequest.Create(url)                                               
           .GetResponse().GetResponseStream()))
        using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
        {
            sw.Write(sr.ReadToEnd());
        }
    }
like image 60
Myles McDonnell Avatar answered Nov 11 '22 05:11

Myles McDonnell


Download files in several threads. Number of threads depends on your throughput. Also, look at WebClient and HttpWebRequest classes. Simple sample:

var list = new[] 
{ 
    "http://google.com", 
    "http://yahoo.com", 
    "http://stackoverflow.com" 
}; 

var tasks = Parallel.ForEach(list,
        s =>
        {
            using (var client = new WebClient())
            {
                Console.WriteLine($"starting to download {s}");
                string result = client.DownloadString((string)s);
                Console.WriteLine($"finished downloading {s}");
            }
        });
like image 32
Kirill Polishchuk Avatar answered Nov 11 '22 06:11

Kirill Polishchuk


I'd use several threads in parallel, with a WebClient. I recommend setting the max degree of parallelism to the number of threads you want, since unspecified degree of parallelism doesn't work well for long running tasks. I've used 50 parallel downloads in one of my projects without a problem, but depending on the speed of an individual download a much lower might be sufficient.

If you download multiple files in parallel from the same server, you're by default limited to a small number (2 or 4) of parallel downloads. While the http standard specifies such a low limit, many servers don't enforce it. Use ServicePointManager.DefaultConnectionLimit = 10000; to increase the limit.

like image 40
CodesInChaos Avatar answered Nov 11 '22 06:11

CodesInChaos