Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strategy for how to crawl/index frequently updated webpages?

Tags:

I'm trying to build a very small, niche search engine, using Nutch to crawl specific sites. Some of the sites are news/blog sites. If I crawl, say, techcrunch.com, and store and index their frontpage or any of their main pages, then within hours my index for that page will be out of date.

Does a large search engine such as Google have an algorithm to re-crawl frequently updated pages very frequently, hourly even? Or does it just score frequently updated pages very low so they don't get returned?

How can I handle this in my own application?

like image 880
OdieO Avatar asked Apr 26 '12 10:04

OdieO


2 Answers

Good question. This is actually an active topic in WWW research community. The technique involved is called Re-crawl Strategy or Page Refresh Policy.

As I know there are three different factors that were considered in the literature:

  • Change frequency (how ofter the content of a web page is updated)
    • [1]: Formalized the notion of "freshness" of data and use a poisson process to model the change of web pages.
    • [2]: Frequency estimator
    • [3]: More of scheduling policy
  • Relevance (how much influence the updated page content has on search results)
    • [4]: Maximize the quality of the user experience for those who query the search engine
    • [5]: Determine the (nearly) optimal crawling frequencies
  • Information Longevity (the lifetimes of content fragments that appear and disappear from web pages over time, which is shown not strongly correlated with change frequency)
    • [6]: distinguish between ephemeral and persistent content

You might want to decide which factor is more important for your application and users. Then you can check the below reference for more details.


Edit: I briefly discuss the frequency estimator mentioned in [2] to get you started. Based on this, you should be able to figure out what might be useful to you in the other papers. :)

Please follow the order I pointed out below to read this paper. It should not be too hard to understand as long as you know some probability and stats 101 (maybe much less if you just take the estimator formula):

Step 1. Please go to Section 6.4 -- Application to a Web crawler. Here Cho listed 3 approaches to estimate the web page change frequency.

  • Uniform policy: A crawler revisits all pages at the frequency of once every week.
  • Naive policy: In the first 5 visits, a crawler visits each page at the frequency of once every week. After the 5 visits, the crawler estimates the change frequencies of the pages using the naive estimator (Section 4.1)
  • Our policy: The crawler uses the proposed estimator (Section 4.2) to estimate change frequency.

Step 2. The naive policy. Please go to section 4. You will read:

Intuitively, we may use X/T (X:the number of detected changes, T: monitoring period) as the estimated frequency of change.

The subsequence section 4.1 just proved this estimation is biased7, in-consistant8 and in-efficient9.

Step 3. The improved estimator. Please go to section 4.2. The new estimator looks like below: enter image description here

where \bar X is n - X (the number of accesses that the element did not change) and n is the number of accesses. So just take this formula and estimate the change frequency. You don't need to understand the proof in the rest of the sub-section.

Step 4. There are some tricks and useful techniques discussed in Section 4.3 and Section 5 that might be helpful to you. Section 4.3 discussed how to deal with irregular intervals. Section 5 solved the question: When the last-modication date of an element is available, how can we use it to estimate change frequency? The proposed estimator using last-modification date is shown below:

enter image description here

The explanation to the above algorithm after Fig.10 in the paper is very clear.

Step 5. Now if you have interest, you can take a look at the experiment setup and results in section 6.

So that's it. If you feel more confident now, go ahead and try the freshness paper in [1].


References

[1] http://oak.cs.ucla.edu/~cho/papers/cho-tods03.pdf

[2] http://oak.cs.ucla.edu/~cho/papers/cho-freq.pdf

[3] http://hal.inria.fr/docs/00/07/33/72/PDF/RR-3317.pdf

[4] http://wwwconference.org/proceedings/www2005/docs/p401.pdf

[5] http://www.columbia.edu/~js1353/pubs/wolf-www02.pdf

[6] http://infolab.stanford.edu/~olston/publications/www08.pdf

like image 74
greeness Avatar answered Mar 03 '23 11:03

greeness


Google's algorithms are mostly closed, they won't tell how they do it.

I built a crawler using the concept of a directed graph and based the re-crawl rate on pages' degree centrality. You could consider a website to be a directed graph with pages as nodes and hyperlinks as edges. A node with high centrality will probably be a page that is updated more often. At least, that is the assumption.

This can be implemented by storing URLs and the links between them. If you crawl and don't throw away any links, the graph per site will grow. Calculating for every node per site the (normalised) in- and outdegree will then give you a measure of which page is most interesting to re-crawl more often.

like image 26
TTT Avatar answered Mar 03 '23 10:03

TTT