Any Good Open Source Web Crawling Framework in C#

Tags:

Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.

I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.

I speak by my own personal experience.

I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.

If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.

If not am planning to extend this solution from code project and extend it.

http://www.codeproject.com/KB/IP/Crawler.aspx

If any one can suggest me a better path, I shall be really thankful.

EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.

814

asked Dec 05 '10 17:12

Sumit Ghosh

1 Answers

PhantomJS + HtmlAgilityPack

I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.

This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.

void Test()
{
    var linkText = @"Help Spread DuckDuckGo!";
    Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
    // as of right now, this would print ‘https://duckduckgo.com/spread’
}

/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
    using (IWebDriver phantom = new PhantomJSDriver())
    {
        phantom.Navigate.GoToUrl(pageUrl);
        var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
        if(link != null)
            return link.GetAttribute("href");
    }
    return string.Empty;
}

176

answered Oct 12 '22 06:10

Paul Smith

Related questions
                            
                                System.Net.WebRequest not respecting hosts file
                            
                                How to learn C# and ASP.NET MVC at the same time? [closed]
                            
                                Serialize Composed Func?
                            
                                .Net enumerate winforms font styles?
                            
                                Assembly resources FileNotFoundException in project upgraded to VS 2010 framework 4.0
                            
                                When to use so many kinds of method to create an instance of a type?
                            
                                how to 'not' a lambda expression for entity framework
                            
                                Combine XmlNodelist
                            
                                Should I expose Actions instead of events?
                            
                                Merging docx files together including headers, footers and pictures
                            
                                Streaming with Node.js, or any other Comet solution
                            
                                JSON deseralization to abstract list using DataContractJsonSerializer
                            
                                C# - Fetching property value from child class
                            
                                Fuzzy Matching with threshold filter C#
                            
                                How can I get an actual EventHandler delegate instance from an event in VB.NET?
                            
                                C# Type A cannot be casted to Type B ( InvalidCastException)... Context hell?
                            
                                Making Cache access methods static
                            
                                is there anyway to provide a ics Calendar file that will automatically keep in sync with updates
                            
                                How do you determine the original .NET language from the compiled assembly?
                            
                                Would people like a Flash to C# Converter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Any Good Open Source Web Crawling Framework in C#

Tags:

c#

web-scraping

screen-scraping

web-crawler

Sumit Ghosh

People also ask

1 Answers

Paul Smith

Recent Activity

Donate For Us