Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

guide on crawling the entire web?

Tags:

web-crawler

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

like image 408
bohohasdhfasdf Avatar asked Jan 17 '10 08:01

bohohasdhfasdf


People also ask

Can you crawl the deep Web?

Note: the deep web shouldn't be confused with the “dark web”, which pertains strictly to pages containing illegal content such as child pornography, terrorist forums, and illegal auctions/transactions. Google Can't Crawl the Deep Web: Google's search engine functions by using “crawlers”.

How do you crawl the whole website on Screaming Frog?

For example, to crawl our blog, you'd then simply enter https://www.screamingfrog.co.uk/blog/ and hit start. Please note, that if there isn't a trailing slash on the end of the subfolder, for example '/blog' instead of '/blog/', the SEO Spider won't currently recognise it as a sub folder and crawl within it.


1 Answers

Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.

You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.

You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).

The killer will be how much data you need to store and what you want to do with it once you've got it.

like image 156
cletus Avatar answered Sep 22 '22 07:09

cletus