Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Extract method yields a Cannot mix str and non-str arguments error

I am in the middle of learning scrappy right now and am building a simple scraper of a real estate site. With this code I am trying to scrape all of the URLs for the real estate listing of a specific city. I have run into the following error with my code - "Cannot mix str and non-str arguments".

I believe I have isolated my problem to following part of my code

props = response.xpath('//div[@class = "address ellipsis"]/a/@href').extract()

If I use the extract_first() function instead of the extract function in the props xpath assignment, the code kind of works. It grabs the first link for the property on each page. However, this ultimately is not what I want. I believe I have the xpath call correct as the code runs if I use the extract_first() method.

Can someone explain what I am doing wrong here? I have listed my full code below

import scrapy
from scrapy.http import Request

class AdvancedSpider(scrapy.Spider):
    name = 'advanced'
    allowed_domains = ['www.realtor.com']
    start_urls = ['http://www.realtor.com/realestateandhomes-search/Houston_TX/']

def parse(self, response):
    props = response.xpath('//div[@class = "address ellipsis"]/a/@href').extract()

    for prop in props:
        absolute_url = response.urljoin(props)
        yield Request(absolute_url, callback=self.parse_props)

    next_page_url = response.xpath('//a[@class = "next"]/@href').extract_first()
    absolute_next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(absolute_next_page_url)



def parse_props(self, response):
    pass

Please let me know if I can clarify anything.

like image 488
Josiah Hulsey Avatar asked Jan 21 '26 23:01

Josiah Hulsey


2 Answers

You are passing props list of strings to response.urljoin() but meant prop instead:

for prop in props:
    absolute_url = response.urljoin(prop)
like image 154
alecxe Avatar answered Jan 27 '26 02:01

alecxe


Alecxe's is right, it was a simple oversight in the spelling of iterator in your loop. You can use the following notation:

for prop in response.xpath('//div[@class = "address ellipsis"]/a/@href').extract():
    yield scrapy.Request(response.urljoin(prop), callback=self.parse_props)

It's cleaner and you're not instantiating the "absolute_url" per loop. On a larger scale, would help you save some memory.

like image 27
Erick Guerra Avatar answered Jan 27 '26 00:01

Erick Guerra



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!