Scrapy: Save response.body as html file?

Tags:

My spider works, but I can't download the body of the website I crawl in a .html file. If I write self.html_fil.write('test') then it works fine. I don't know how to convert the tulpe to string.

I use Python 3.6

Spider:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'
        self.html_file = open(self.path_to_html, 'w')

    def parse(self, response):
        url = response.url
        self.html_file.write(response.body)
        self.html_file.close()
        yield {
            'url': url
        }

Tracktrace:

Traceback (most recent call last):
  File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
    self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes

456

asked Sep 06 '17 05:09

bonblow

3 Answers

Actual problem is you are getting byte code. You need to convert it to string format. there are many ways for converting byte to string format. You can use

 self.html_file.write(response.body.decode("utf-8"))

instead of

  self.html_file.write(response.body)

also you can use

  self.html_file.write(response.text)

161

answered Nov 03 '22 22:11

Somil

The correct way is to use response.text, and not response.body.decode("utf-8"). To quote documentation:

Keep in mind that Response.body is always a bytes object. If you want the unicode version use TextResponse.text (only available in TextResponse and subclasses).

and

text: Response body, as unicode.

The same as response.body.decode(response.encoding), but the result is cached after the first call, so you can access response.text multiple times without extra overhead.

Note: unicode(response.body) is not a correct way to convert response body to unicode: you would be using the system default encoding (typically ascii) instead of the response encoding.

answered Nov 03 '22 22:11

nirvana-msu

Taking in consideration responses above, and making it as much pythonic as possible adding the use of the with statement, the example should be rewritten like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

But the html_file will only accessible from the parse method.

answered Nov 03 '22 20:11

Mariano Ruiz

Related questions
                            
                                How to forbid two conflicting options
                            
                                Drawing filled polygon using mouse events in open cv using python
                            
                                linear interpolation between two data points
                            
                                Sklearn - How to predict probability for all target labels
                            
                                Downloading a song through python-requests
                            
                                hash function that outputs integer from 0 to 255?
                            
                                Serve protected media files with django
                            
                                send email with a pandas dataframe as attachment
                            
                                Difficulty with python while installing YouCompleteMe in vim
                            
                                Elegant way to delete items in a list which do not has substrings that appear in another list
                            
                                How exactly does random.random() work in python?
                            
                                PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages
                            
                                Representing voxels with matplotlib
                            
                                Fastest way to cast all dataframe columns to float - pandas astype slow
                            
                                How to get the symmetric difference of two dictionaries
                            
                                Keras training only specific outputs
                            
                                TypeError: run() missing 1 required positional argument: 'fetches' on Session.run()
                            
                                How to split one column into multiple columns in Pandas using regular expression?
                            
                                Incremental training of random forest model using python sklearn
                            
                                Checking if an environment variable exists and is set to True [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy: Save response.body as html file?

Tags:

python

django

scrapy

web-crawler