Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.
I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.
I speak by my own personal experience.
I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.
If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.
If not am planning to extend this solution from code project and extend it.
http://www.codeproject.com/KB/IP/Crawler.aspx
If any one can suggest me a better path, I shall be really thankful.
EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.
As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP.
Python is mostly known as the best web scraper language. It's more like an all-rounder and can handle most of the web crawling-related processes smoothly. Beautiful Soup is one of the most widely used frameworks based on Python that makes scraping using this language such an easy route to take.
PhantomJS + HtmlAgilityPack
I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.
This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.
void Test()
{
var linkText = @"Help Spread DuckDuckGo!";
Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
// as of right now, this would print ‘https://duckduckgo.com/spread’
}
/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
using (IWebDriver phantom = new PhantomJSDriver())
{
phantom.Navigate.GoToUrl(pageUrl);
var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
if(link != null)
return link.GetAttribute("href");
}
return string.Empty;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With