I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.
I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.
You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.
The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.
If you are spidering just one site, consider using wget -r
(an example).
Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With