Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to crawl all links of a webpage with scrapy. But I cannot output the links on a page

My first question here :)

I was trying to crawl my schools website for all possible webpages there are. But I cannot get the links into a text file. I have the right permissions, so that is not the problem.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider

class hsleidenSpider(CrawlSpider):
        name = "hsleiden1"
        allowed_domains = ["hsleiden.nl"]
        start_urls = ["http://hsleiden.nl"]

        # allow=() is used to match all links
        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True),
        Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
        ]

        def parse_item(self, response):
                x = HtmlXPathSelector(response)

                filename = "hsleiden-output.txt"
                open(filename, 'ab').write(response.url)

So I am only scanning on the hsleiden.nl page. And I would like to have the response.url into the textfile hsleiden-output.txt.

Is there any way to do this right?

like image 982
Jasper Nugteren Avatar asked Nov 12 '22 15:11

Jasper Nugteren


1 Answers

With reference to the documentation for CrawlSpider, if multiple rules match the same link then only the first will be used.

Thus, as a result of redirects, using the first rule results in a seemingly infinite loop. Since the second rule is ignored, none of the matching links are ever passed to the parse_item callback, which means no output file.

Some investigation is required to fix the redirect issue (and to modify the first rule so that it doesn't clash with the second), but commenting it out entirely will produce an output file of links like so:

http://www.hsleiden.nl/activiteitenkalenderhttp://www.hsleiden.nlhttp://www.hsleiden.nl/vind-je-studie/proefstuderenhttp://www.hsleiden.nl/studiumgenerale

etc

They were all munged together on a single line, so you might want to add a newline character or separator each time you write to the output file.

like image 125
Talvalin Avatar answered Nov 15 '22 04:11

Talvalin