Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to scrape web pages all within one web site

I have a C# app that needs to scrape many many pages within a certain domain as fast as possible. I have a Parallel.Foreach that loops through all of the urls (multi-threaded) and scrapes them using the code below:

private string ScrapeWebpage(string url, DateTime? updateDate)
        {
            HttpWebRequest request = null;
            HttpWebResponse response = null;
            Stream responseStream = null;
            StreamReader reader = null;
            string html = null;

            try
            {
                //create request (which supports http compression)
                request = (HttpWebRequest)WebRequest.Create(url);
                request.Pipelined = true;
                request.KeepAlive = true;
                request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
                if (updateDate != null)
                    request.IfModifiedSince = updateDate.Value;

                //get response.
                response = (HttpWebResponse)request.GetResponse();
                responseStream = response.GetResponseStream();
                if (response.ContentEncoding.ToLower().Contains("gzip"))
                    responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
                else if (response.ContentEncoding.ToLower().Contains("deflate"))
                    responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);

                //read html.
                reader = new StreamReader(responseStream, Encoding.Default);
                html = reader.ReadToEnd();
            }
            catch
            {
                throw;
            }
            finally
            {//dispose of objects.
                request = null;
                if (response != null)
                {
                    response.Close();
                    response = null;
                }
                if (responseStream != null)
                {
                    responseStream.Close();
                    responseStream.Dispose();
                }
                if (reader != null)
                {
                    reader.Close();
                    reader.Dispose();
                }
            }
            return html;
        }

As you can see, I have http compression support and have set request.keepalive and request.pipelined to true. I'm wondering if the code I'm using is the fastest way to scrape many web pages within the same site or if there's a better way that will keep the session open for multiple requests. My code is creating a new request instance for each page I hit, should I be trying to use just one request instance to hit all of the pages? Is it ideal to have pipelined and keepalive enabled?

like image 442
Justin Avatar asked Nov 13 '22 21:11

Justin


1 Answers

It turns out what I was missing was this:

ServicePointManager.DefaultConnectionLimit = 1000000;
like image 160
Justin Avatar answered Dec 06 '22 06:12

Justin