How can I bring google-like recrawling in my application(web or console)

Tags:

How can I bring google-like recrawling in my application(web or console). I need only those pages to be recrawled which are updated after a particular date.

The LastModified header in the System.Net.WebResponse gives only the current date of the server. For example if I downloaded one page with HTTPWebRequest on 27 January 2012, and check the header for the LastModified date, it is showing the current time of the server when the page was served. In this case it is 27 January 2012 only.

Can anyone suggest any other methods?

410

asked Jan 27 '12 06:01

Sunil Raj

1 Answers

First, to point out here is that what you're trying to do is very difficult and there is a great deal of research-level papers that try to address it (I will give you links to a few of them a little later). There is no way to see if a site has changed without crawling it, although you can have shortcuts like checking the Content-Length from the response header without downloading the rest of the page. This will allow your system to save on traffic, but it won't resolve your problem in a manner that's really useful.

Second, since you're concerned about content, then Last-Modified header field will not be very useful for you and I would even go as far as to say that it will not be useful at all.

And third, what you're describing has somewhat conflicting requirements, because you're interested in crawling only the pages that have updated content and that's not exactly how Google does things (yet, you want google-like crawling). Google's crawling is focused on providing the freshest content for the most frequently searched/visited websites. For example: Google has very little interest in frequently crawling a website that updates its content twice a day when that website has 10 visitors a day, instead Google is more interested in crawling a website that gets 10 million visitors a day even if its content updates less frequently. It may be also true that websites that update their content frequently also have a lot of visitors, but from Google's perspective that's not exactly relevant.

If you have to discover new websites (coverage) and at the same time you want to have the latest content of the sites you know about (freshness), then you have conflicting goals (which is true for most crawlers, even Google). Usually what ends up happening is that when you have more coverage you have less freshness and if you have more freshness then you have less coverage. If you're interested in balancing both, then I suggest you read the following articles:

Web Crawler: An Overview
After that, I would recommend reading Adaptive On-Line Page Importance Computation
And finally: Scaling to 6 Billion Pages and Beyond

The summary of the idea is that you have to crawl a website several times (maybe several hundred times) in order for you to build up a good measure of its history. Once you have a good set of historical measures, then you use a predictive model to interpolate when will the website change again and you schedule a crawl for some time after the expected change.

answered Sep 23 '22 14:09

Kiril

Related questions
                            
                                Am I confused about interfaces?
                            
                                Multipage PDF document from predefined template
                            
                                DataGridView combobox cell event in c#
                            
                                Intersection of two sets (Lists) of data
                            
                                Getting Actual Size of UserControl before rendering
                            
                                How to create instance of class in XAML?
                            
                                Magento API v2 and C# - set custom attributes whilst adding product
                            
                                fastest way to search huge list of big texts
                            
                                Alternative to Directory.CreateDirectory(path) supporting long paths
                            
                                Integrating PayPal in C#/.NET Solution using WSDL (SOAP)
                            
                                What is a fast, memory efficient way to pass data between threads in C#?
                            
                                Is it possible to configure a WCF service using castle windsor fluent configuration without config or svc files?
                            
                                CacheItem regionName property responsibility/uses?
                            
                                How to apply Graphics scale and translate to the TextRenderer
                            
                                Why do we need to add a reference to an assembly, from which a class library project inherits, into a consumer project?
                            
                                How to check if private/public key pair match using (.NET / BouncyCastle)?
                            
                                Is [ComImport] considered P/Invoke?
                            
                                Asp.NET Real Time Game
                            
                                yield always gets called
                            
                                Is the DllImport attribute always loading the unmanaged DLL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I bring google-like recrawling in my application(web or console)

Tags:

c#

asp.net

web-crawler

Sunil Raj

People also ask

1 Answers

Kiril

Recent Activity

Donate For Us