I am trying to scrap a site using scrapy.
This is the code I have written so far based on http://thuongnh.com/building-a-web-crawler-with-scrapy/ (original code does not work at all so I tried to rebuild it)
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Spider
from scrapy.selector import HtmlXPathSelector
from nettuts.items import NettutsItem
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
class MySpider(Spider):
name = "nettuts"
allowed_domains = ["net.tutsplus.com"]
start_urls = ["http://code.tutsplus.com/posts?"]
rules = [Rule(LinkExtractor(allow = ('')), callback = 'parse', follow = True)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = []
titles = hxs.xpath('//li[@class="posts__post"]/a/text()').extract()
for title in titles:
item = NettutsItem()
item["title"] = title
yield item
return
Problem is that crawler goes to the start page but does not scrap any pages after that.
As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling. In our command line, execute: In this article, we will use Yummly as an example.
In data analytics, the most important resource is the data itself. As web crawling is defined as “programmatically going over a collection of web pages and extracting data”, it is a helpful trick to collect data without an official API. Scrapy is a powerful tool when using python in web crawling.
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run: This will create a tutorial directory with the following contents: Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run: Remember to always enclose urls in quotes when running Scrapy shell from command-line, otherwise urls containing arguments (i.e. & character) will not work.
Following can be a good idea to start with:
There can be two use cases for 'Crawling a site recursively using scrapy'.
A). We just want to move across the website using, say, the pagination buttons of the table and fetch data. This is relatively straight forward.
class TrainSpider(scrapy.Spider):
name = "trip"
start_urls = ['somewebsite']
def parse(self, response):
''' do something with this parser '''
next_page = response.xpath("//a[@class='next_page']/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)`
Observe the last 4 lines. Here
B)Not only we want to move across pages, but we also want to extract data from one or more links in that page.
class StationDetailSpider(CrawlSpider):
name = 'train'
start_urls = [someOtherWebsite]
rules = (
Rule(LinkExtractor(restrict_xpaths="//a[@class='next_page']"), follow=True),
Rule(LinkExtractor(allow=r"/trains/\d+$"), callback='parse_trains')
)
def parse_trains(self, response):
'''do your parsing here'''
Over here, observe that:
We are using the 'CrawlSpider' subclass of the 'scrapy.Spider' parent class
We have set to 'Rules'
a) The first rule just checks if there is a 'next_page' available and follows it.
b) The second rule requests for all the links on a page that are in the format, say '/trains/12343' and then calls the 'parse_trains' to perform and parsing operation.
Important: Note that we don't want to use the regular 'parse' method over here as we are using 'CrawlSpider' subclass. This class also has a 'parse' method so we don't want to override that. Just remember to name your call back method something other than 'parse'.
The problem is what Spider
class you are using as a base. The scrapy.Spider
is a simple spider that does not support rules and link extractors.
Instead, use CrawlSpider
:
class MySpider(CrawlSpider):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With