My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.
I use Python 3.6
Spider:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
self.html_file = open(self.path_to_html, 'w')
def parse(self, response):
url = response.url
self.html_file.write(response.body)
self.html_file.close()
yield {
'url': url
}
Tracktrace:
Traceback (most recent call last):
File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
Essentially, I had to connect to the database, get the url and product_id then scrape the URL while passing its product id. All these had to be done in start_requests because that is the function scrapy invokes to request urls. This function has to return a Request object.
You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.
Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use
self.html_file.write(response.body.decode("utf-8"))
instead of
self.html_file.write(response.body)
also you can use
self.html_file.write(response.text)
The correct way is to use response.text
, and not response.body.decode("utf-8")
. To quote documentation:
Keep in mind that
Response.body
is always a bytes object. If you want the unicode version useTextResponse.text
(only available inTextResponse
and subclasses).
and
text: Response body, as unicode.
The same as
response.body.decode(response.encoding)
, but the result is cached after the first call, so you can accessresponse.text
multiple times without extra overhead.Note:
unicode(response.body)
is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.
Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with
statement, the example should be rewritten like:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ['google.com']
start_urls = ['http://google.com/']
def __init__(self):
self.path_to_html = html_path + 'index.html'
self.path_to_header = header_path + 'index.html'
def parse(self, response):
with open(self.path_to_html, 'w') as html_file:
html_file.write(response.text)
yield {
'url': response.url
}
But the html_file
will only accessible from the parse
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With