Fastest way to scrape web pages all within one web site

Question

I have a C# app that needs to scrape many many pages within a certain domain as fast as possible. I have a Parallel.Foreach that loops through all of the urls (multi-threaded) and scrapes them using the code below:

private string ScrapeWebpage(string url, DateTime? updateDate)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            Stream responseStream = null;
            StreamReader reader = null;
            string html = null;

            try
            {
                //create request (which supports http compression)
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Pipelined = true;
                request.KeepAlive = true;
                request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
                if (updateDate != null)
                    request.IfModifiedSince = updateDate.Value;

                //get response.
                response = (HttpWebResponse)request.GetResponse();
                responseStream = response.GetResponseStream();
                if (response.ContentEncoding.ToLower().Contains("gzip"))
                    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
                else if (response.ContentEncoding.ToLower().Contains("deflate"))
                    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);

                //read html.
                reader = new StreamReader(responseStream, Encoding.Default);
                html = reader.ReadToEnd();
            }
            catch
            {
                throw;
            }
            finally
            {//dispose of objects.
                request = null;
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (responseStream != null)
                {
                    responseStream.Close();
                    responseStream.Dispose();
                }
                if (reader != null)
                {
                    reader.Close();
                    reader.Dispose();
                }
            }
            return html;
        }

As you can see, I have http compression support and have set request.keepalive and request.pipelined to true. I'm wondering if the code I'm using is the fastest way to scrape many web pages within the same site or if there's a better way that will keep the session open for multiple requests. My code is creating a new request instance for each page I hit, should I be trying to use just one request instance to hit all of the pages? Is it ideal to have pipelined and keepalive enabled?

Justin · Accepted Answer

It turns out what I was missing was this:

ServicePointManager.DefaultConnectionLimit = 1000000;

Fastest way to scrape web pages all within one web site

Tags:

c#

httpwebrequest

screen-scraping

Justin

1 Answers

Justin

Recent Activity

Donate For Us

Fastest way to scrape web pages all within one web site

Tags:

c#

httpwebrequest

screen-scraping

Justin

1 Answers

Justin

Related questions

Recent Activity

Donate For Us