Writing a Faster Python Spider

Question

I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.

What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.

I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.

John Paulett · Accepted Answer

You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.

The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.

If you are spidering just one site, consider using wget -r (an example).

BrainCore · Answer

Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.

Writing a Faster Python Spider

Tags:

python

web-crawler

MMag

2 Answers

John Paulett

BrainCore

Recent Activity

Donate For Us

Writing a Faster Python Spider

Tags:

python

web-crawler

MMag

2 Answers

John Paulett

BrainCore

Related questions

Recent Activity

Donate For Us