Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast internet crawler

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.

I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.

So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?

EDIT: The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.

like image 564
pbp Avatar asked Oct 04 '11 19:10

pbp


2 Answers

Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).

Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:

urllib benchmark

UPDATE:

Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.

like image 62
Attila O. Avatar answered Sep 20 '22 08:09

Attila O.


Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.

  1. Optimize your file pointer use. Try to use minimal file pointer.
  2. Don't write your data every time. Try to dump your data after storing around 5000 url or 10000 url.
  3. For your robustness you don't need to use different configuration. Try to Use a log file and when you want to resume then just try to read the log file and resume your crawler.
  4. Distributed all your webcrawler task. And process it in a interval wise.

    a. downloader

    b. link extractor

    c. URLSeen

    d. ContentSeen

like image 44
Mohiul Alam Avatar answered Sep 20 '22 08:09

Mohiul Alam