Fast internet crawler

Question

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.

I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 100Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.

So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?

EDIT: The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.

Attila O. · Accepted Answer

Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).

Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:

urllib benchmark

UPDATE:

Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.

Mohiul Alam · Answer

Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.

Optimize your file pointer use. Try to use minimal file pointer.
Don't write your data every time. Try to dump your data after storing around 5000 url or 10000 url.
For your robustness you don't need to use different configuration. Try to Use a log file and when you want to resume then just try to read the log file and resume your crawler.
Distributed all your webcrawler task. And process it in a interval wise.

a. downloader

b. link extractor

c. URLSeen

d. ContentSeen

Fast internet crawler

Tags:

python

multithreading

web-crawler

web-mining

pbp

2 Answers

UPDATE:

Attila O.

Mohiul Alam

Recent Activity

Donate For Us

Fast internet crawler

Tags:

python

multithreading

web-crawler

web-mining

pbp

2 Answers

UPDATE:

Attila O.

Mohiul Alam

Related questions

Recent Activity

Donate For Us