My first question here :)
I was trying to crawl my schools website for all possible webpages there are. But I cannot get the links into a text file. I have the right permissions, so that is not the problem.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
class hsleidenSpider(CrawlSpider):
name = "hsleiden1"
allowed_domains = ["hsleiden.nl"]
start_urls = ["http://hsleiden.nl"]
# allow=() is used to match all links
rules = [
Rule(SgmlLinkExtractor(allow=()), follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self, response):
x = HtmlXPathSelector(response)
filename = "hsleiden-output.txt"
open(filename, 'ab').write(response.url)
So I am only scanning on the hsleiden.nl page. And I would like to have the response.url into the textfile hsleiden-output.txt.
Is there any way to do this right?
With reference to the documentation for CrawlSpider, if multiple rules match the same link then only the first will be used.
Thus, as a result of redirects, using the first rule results in a seemingly infinite loop. Since the second rule is ignored, none of the matching links are ever passed to the parse_item callback, which means no output file.
Some investigation is required to fix the redirect issue (and to modify the first rule so that it doesn't clash with the second), but commenting it out entirely will produce an output file of links like so:
http://www.hsleiden.nl/activiteitenkalenderhttp://www.hsleiden.nlhttp://www.hsleiden.nl/vind-je-studie/proefstuderenhttp://www.hsleiden.nl/studiumgenerale
etc
They were all munged together on a single line, so you might want to add a newline character or separator each time you write to the output file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With