Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Save response.body as html file?

My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.

I use Python 3.6

Spider:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'
        self.html_file = open(self.path_to_html, 'w')

    def parse(self, response):
        url = response.url
        self.html_file.write(response.body)
        self.html_file.close()
        yield {
            'url': url
        }

Tracktrace:

Traceback (most recent call last):
  File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
    self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes
like image 456
bonblow Avatar asked Sep 06 '17 05:09

bonblow


People also ask

How do you get a Scrapy response?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you pass meta in Scrapy?

Essentially, I had to connect to the database, get the url and product_id then scrape the URL while passing its product id. All these had to be done in start_requests because that is the function scrapy invokes to request urls. This function has to return a Request object.

How do you set a header in Scrapy?

You need to set the user agent which Scrapy allows you to do directly. import scrapy class QuotesSpider(scrapy. Spider): # ... user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.


3 Answers

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use

 self.html_file.write(response.body.decode("utf-8"))

instead of

  self.html_file.write(response.body)

also you can use

  self.html_file.write(response.text)
like image 161
Somil Avatar answered Nov 03 '22 22:11

Somil


The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:

Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

and

text: Response body, as unicode.

The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.

Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

like image 24
nirvana-msu Avatar answered Nov 03 '22 22:11

nirvana-msu


Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

But the html_file will only accessible from the parse method.

like image 43
Mariano Ruiz Avatar answered Nov 03 '22 20:11

Mariano Ruiz