Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download an Entire Website in C#

Forgive my ignorance on the subject

I am using

 string p="http://" + Textbox2.text;
 string r= textBox3.Text;
 System.Net.WebClient webclient=new
 System.Net.Webclient();
 webclient.DownloadFile(p,r);

to download a webpage. Can you please help me with enhancing the code so that it downloads the entire website. Tried using HTML Screen Scraping but it returns me only the href links of the index.html files. How do i proceed ahead

Thanks

like image 450
Karthik Avatar asked Jan 19 '10 07:01

Karthik


People also ask

How can I download an entire website?

Open the three-dot menu on the top right and select More Tools > Save page as. You can also right-click anywhere on the page and select Save as or use the keyboard shortcut Ctrl + S in Windows or Command + S in macOS. Chrome can save the complete web page, including text and media assets, or just the HTML text.


2 Answers

Scraping a website is actually a lot of work, with a lot of corner cases.

Invoke wget instead. The manual explains how to use the "recursive retrieval" options.

like image 80
Will Avatar answered Oct 06 '22 01:10

Will


 protected string GetWebString(string url)
    {
        string appURL = url;
        HttpWebRequest wrWebRequest = WebRequest.Create(appURL) as HttpWebRequest;
        HttpWebResponse hwrWebResponse = (HttpWebResponse)wrWebRequest.GetResponse();

        StreamReader srResponseReader = new StreamReader(hwrWebResponse.GetResponseStream());
        string strResponseData = srResponseReader.ReadToEnd();
        srResponseReader.Close();
        return strResponseData;
    }

This puts the webpage into a string from the supplied URL.

You can then use REGEX to parse through the string.

This little piece gets specific links out of craigslist and adds them to an arraylist...Modify to your purpose.

 protected ArrayList GetListings(int pages)
    {
            ArrayList list = new ArrayList();
            string page = GetWebString("http://albany.craigslist.org/bik/");

            MatchCollection listingMatches = Regex.Matches(page, "(<p><a href=\")(?<LINK>/.+/.+[.]html)(\">)(?<TITLE>.*)(-</a>)");
            foreach (Match m in listingMatches)
            {
                list.Add("http://albany.craigslist.org" + m.Groups["LINK"].Value.ToString());
            }
            return list;
    }
like image 30
Jason Avatar answered Oct 05 '22 23:10

Jason