Scrapy Extract method yields a Cannot mix str and non-str arguments error

Question

I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".

I believe I have isolated my problem to following part of my code

props = response.xpath('//div[@class = "address ellipsis"]/a/@href').extract()

If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.

Can someone explain what I am doing wrong here? I have listed my full code below

import scrapy
from scrapy.http import Request

class AdvancedSpider(scrapy.Spider):
    name = 'advanced'
    allowed_domains = ['www.realtor.com']
    start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']

def parse(self, response):
    props = response.xpath('//div[@class = "address ellipsis"]/a/@href').extract()

    for prop in props:
        absolute_url = response.urljoin(props)
        yield Request(absolute_url, callback=self.parse_props)

    next_page_url = response.xpath('//a[@class = "next"]/@href').extract_first()
    absolute_next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(absolute_next_page_url)



def parse_props(self, response):
    pass

Please let me know if I can clarify anything.

alecxe · Accepted Answer

You are passing props list of strings to response.urljoin() but meant prop instead:

for prop in props:
    absolute_url = response.urljoin(prop)

Erick Guerra · Answer

Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:

for prop in response.xpath('//div[@class = "address ellipsis"]/a/@href').extract():
    yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)

It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.

Scrapy Extract method yields a Cannot mix str and non-str arguments error

Tags:

python-3.x

web-scraping

scrapy

Josiah Hulsey

2 Answers

alecxe

Erick Guerra

Recent Activity

Donate For Us

Scrapy Extract method yields a Cannot mix str and non-str arguments error

Tags:

python-3.x

web-scraping

scrapy

Josiah Hulsey

2 Answers

alecxe

Erick Guerra

Related questions

Recent Activity

Donate For Us