Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy error: TypeError: __init__() got an unexpected keyword argument 'callback'

Tags:

python

scrapy

I'm trying to scrape a website by extracting all links with "huis" (="house" in Dutch) in them. Following http://doc.scrapy.org/en/latest/topics/spiders.html, I'm trying

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

from Funda.items import FundaItem

class FundaSpider(scrapy.Spider):
    name = "Funda"
    allowed_domains = ["funda.nl"]
    start_urls = [
        "http://www.funda.nl/koop/amsterdam/"
    ]

    rules = (
    Rule(LinkExtractor(allow=r'.*huis.*', callback='parse_item'))
    )

    def parse_item(self, response):
        item = FundaItem()
        item['title'] = response.extract()
        return item

However, I'm getting the error message

Rule(LinkExtractor(allow=r'.*huis.*', callback='parse_item'))
TypeError: __init__() got an unexpected keyword argument 'callback'

From a previous post (Scrapy Error: TypeError: __init__() got an unexpected keyword argument 'deny') it looks like a possible reason is mismatched brackets, such that the keyword is passed to Rule instead of LinkExtractor. It seems to me that in this case, however, callback is within the LinkExtractor bracket as intended.

Any ideas what is causing this error?

like image 752
Kurt Peek Avatar asked Apr 16 '26 06:04

Kurt Peek


1 Answers

Yes, callback is definitely being passed to LinkExtractor. That seems to be the problem, actually, because I don't see callback under the expected parameters for that class in the documentation.

I see that the Rule class does have a callback parameter listed in the documentation. So maybe you're supposed to pass it to Rule instead of LinkExtractor?

Rule(LinkExtractor(allow=r'.*huis.*'), callback='parse_item')

If you're thinking "but why did the answerer of the linked question put callback inside the LinkExtractor call?", I think you may be misinterpreting the nesting of the parentheses, which is admittedly somewhat confusing. Changing the layout makes it a little clearer:

rules = (
    Rule(
        LinkExtractor(
            allow=[r'/*'], 
            deny=('blogs/*', 'videos/*', )
        ),
        callback='parse_html'
    ), 
)
like image 88
Kevin Avatar answered Apr 17 '26 18:04

Kevin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!