Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A web crawler in python. Where should i start and what should i follow? - Help needed

I have an intermediate knowledge in python. if i have to write a web crawler in python, what things should i follow and where should i begin. is there any specific tut? any advice would be of much help.. thanks

like image 641
The Learner Avatar asked Jul 29 '10 05:07

The Learner


People also ask

How do I setup a web crawler?

Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.

Can Python be used for web crawler?

Web crawling is a powerful technique to collect data from the web by finding all the URLs for one or multiple domains. Python has several popular web crawling libraries and frameworks.


2 Answers

I strongly recommend taking a look at Scrapy. The library can work with BeautifulSoup, or any of your preferred HTML parser. I personally use it with lxml.html.

Out of the box, you receive several things for free:

  • Concurrent requests, thanks to Twisted
  • CrawlSpider objects recursively look for links in the whole site
  • Great separation of data extraction & processing, which makes the most of the parallel processing capabilities
like image 161
Tim McNamara Avatar answered Oct 05 '22 04:10

Tim McNamara


You will surely need an html parsing library. For this you can use BeautifulSoup. You can find lots of samples and tutorials for fetching urls and processing the returned html in the offical page: http://www.crummy.com/software/BeautifulSoup/

like image 33
Giljed Jowes Avatar answered Oct 05 '22 03:10

Giljed Jowes