i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .
I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.
Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....
is it possible ?
I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?
for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?
Note: the deep web shouldn't be confused with the “dark web”, which pertains strictly to pages containing illegal content such as child pornography, terrorist forums, and illegal auctions/transactions. Google Can't Crawl the Deep Web: Google's search engine functions by using “crawlers”.
For example, to crawl our blog, you'd then simply enter https://www.screamingfrog.co.uk/blog/ and hit start. Please note, that if there isn't a trailing slash on the end of the subfolder, for example '/blog' instead of '/blog/', the SEO Spider won't currently recognise it as a sub folder and crawl within it.
Crawling the Web is conceptually simple. Treat the Web as a very complicated directed graph. Each page is a node. Each link is a directed edge.
You could start with the assumption that a single well-chosen starting point will eventually lead to every other point (eventually). This won't be strictly true but in practice I think you'll find it's mostly true. Still chances are you'll need multiple (maybe thousands) of starting points.
You will want to make sure you don't traverse the same page twice (within a single traversal). In practice the traversal will take so long that it's merely a question of how long before you come back to a particular node and also how you detect and deal with changes (meaning the second time you come to a page it may have changed).
The killer will be how much data you need to store and what you want to do with it once you've got it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With