I'm writing some scraping codes and experiencing an error as above. My code is following.
# -*- coding: utf-8 -*-
import scrapy
from myproject.items import Headline
class NewsSpider(scrapy.Spider):
name = 'IC'
allowed_domains = ['kosoku.jp']
start_urls = ['http://kosoku.jp/ic.php']
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
for url in response.css('table a::attr("href")'):
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
def parse_topics(self, response):
"""
pick up necessary information
"""
item=Headline()
item["name"]=response.css("h2#page-name ::text").re(r'.*(インターチェンジ)')
item["road"]=response.css("div.ic-basic-info-left div:last-of-type ::text").re(r'.*道$')
yield item
I can get the correct response when I do them individually on a shell script, but once it gets in a programme and run, it doesn't happen.
2017-11-27 18:26:17 [scrapy.core.scraper] ERROR: Spider error processing <GET http://kosoku.jp/ic.php> (referer: None)
Traceback (most recent call last):
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Users/sonogi/scraping/myproject/myproject/spiders/IC.py", line 16, in parse
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
File "/Users/sonogi/envs/scrapy/lib/python3.5/site-packages/scrapy/http/response/text.py", line 82, in urljoin
return urljoin(get_base_url(self), url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 424, in urljoin
base, url, _coerce_result = _coerce_args(base, url)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/parse.py", line 120, in _coerce_args
raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-11-27 18:26:17 [scrapy.core.engine] INFO: Closing spider (finished)
I'm so confused and appreciate anyone's help upfront!
According to the Scrapy documentation, the .css(selector)
method that you're using, returns a SelectorList instance. If you want the actual (unicode) string version of the url, call the extract()
method:
def parse(self, response):
for url in response.css('table a::attr("href")').extract():
yield(scrapy.Request(response.urljoin(url), self.parse_topics))
You're getting this error because of code at line 15.
As response.css('table a::attr("href")')
returns the object of type list
so you've to first convert the type of url
from list
to str
and then you can parse your code to another function.
further the attr
syntax might will lead you an error as the correct attr tag doesn't has ""
so instead of a::attr("href")
it would be a::attr(href)
.
So after removing above two issues the code will look something like this:
def parse(self, response):
"""
extract target urls and combine them with the main domain
"""
url = response.css('table a::attr(href)')
url_str = ''.join(map(str, url)) #coverts list to str
yield response.follow(url_str, self.parse_topics)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With