Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any Good Open Source Web Crawling Framework in C#

Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.

I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.

I speak by my own personal experience.

I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.

If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.

If not am planning to extend this solution from code project and extend it.

http://www.codeproject.com/KB/IP/Crawler.aspx

If any one can suggest me a better path, I shall be really thankful.

EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.

like image 814
Sumit Ghosh Avatar asked Dec 05 '10 17:12

Sumit Ghosh


People also ask

Can you do web scraping with C?

As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP.

What is the best programming language for developing a web crawler?

Python is mostly known as the best web scraper language. It's more like an all-rounder and can handle most of the web crawling-related processes smoothly. Beautiful Soup is one of the most widely used frameworks based on Python that makes scraping using this language such an easy route to take.


1 Answers

PhantomJS + HtmlAgilityPack

I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.

This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.

void Test()
{
    var linkText = @"Help Spread DuckDuckGo!";
    Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
    // as of right now, this would print ‘https://duckduckgo.com/spread’
}

/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
    using (IWebDriver phantom = new PhantomJSDriver())
    {
        phantom.Navigate.GoToUrl(pageUrl);
        var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
        if(link != null)
            return link.GetAttribute("href");
    }
    return string.Empty;
}
like image 176
Paul Smith Avatar answered Oct 12 '22 06:10

Paul Smith