Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a Faster Python Spider

I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.

What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.

I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.

like image 836
MMag Avatar asked Dec 05 '09 22:12

MMag


2 Answers

You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.

The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.

If you are spidering just one site, consider using wget -r (an example).

like image 126
John Paulett Avatar answered Sep 28 '22 00:09

John Paulett


Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.

like image 26
BrainCore Avatar answered Sep 27 '22 23:09

BrainCore