Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple web crawler in C#

Tags:

c#

web-crawler

I have created a simple web crawler but I want to add the recursion function so that every page that is opened I can get the URLs in this page, but I have no idea how I can do that and I want also to include threads to make it faster. Here is my code

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;

        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            
            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;

            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource

            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream

            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only
            
            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();




        }

        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);
            
            IHTMLElementCollection L = doc.links;
           
            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }
like image 216
Khaled Mohamed Avatar asked May 04 '12 16:05

Khaled Mohamed


2 Answers

I fixed your GetContent method as follow to get new links from crawled page:

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}

Updated

Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

like image 116
Darius Kucinskas Avatar answered Sep 22 '22 12:09

Darius Kucinskas


i have created something similar using Reactive Extension.

https://github.com/Misterhex/WebCrawler

i hope it can help you.

Crawler crawler = new Crawler();

IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));

observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));
like image 43
Misterhex Avatar answered Sep 22 '22 12:09

Misterhex