Scrapy: Modifying elements and fields in a response

Tags:

I'm relatively new to Scrapy, Python and object-oriented programming so apologies if I get any terminology incorrect or am unclear in any way.

I'm trying to write a spider that, as it scrapes items from a response, will also create a modified version of the response to save to file. For example, I'm trying to alter 'src' links to point to scraped files saved locally.

Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.

I've added some code below to illustrate my point. Everything occurs in the spider parse function.

def parse(self, response):

    # Scrape thumbnail URLs using Scrapy selectors
    for post in response.css('.post'): # For each post
        for thumb in post.css('.thumb'): # For each thumbnail
            item = Item() # Create an image item
            item['thumbnail_url'] = []
            item['thumbnail_savepath'] = []
            for x in thumb.xpath('img/@src').extract():
                thumbnail_url = 'https:' + x
                thumbnail_filename = re.search('.*/(.*)', thumbnail_url).group(1)
                thumbnail_savepath = 'thumbnails/' + thumbnail_filename
                item['thumbnail_url'] += [thumbnail_url]
                item['thumbnail_savepath'] += [thumbnail_savepath]

    # Make modified html using lxml
    body_lxml = lxml.html.document_fromstring(response.body)
    for thumbnail in body_lxml.xpath('//img'):
        thumbnail_src = thumbnail.get('src') # Original link address
        thumbnail_path = './thumbnails/' + basename(thumbnail_src) # New link address
        thumbnail.set('src',image_path) # Setting new link address

As the code shows, it iterates through the images to scrape the items using Scrapy selectors, then iterates a second time using lxml for modifying the response. I have to use two different methods to loop over the same elements, which I'm trying to avoid. I'd like to do the scraping and modification in the same for loop if possible.

I was thinking it was possible to use the response.request() method but am struggling to understand how to use this from the documentation and searches online. Is there some method that allows Scrapy to modify individual elements or fields in a response? Any help would be appreciated.

Thanks.

610

asked Jul 19 '15 14:07

Marcus

1 Answers

Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.

Parsel Selectors (which Scrapy uses underneath) are designed to extract information, not to edit the underlying HTML. I believe your current approach is the best possible approach.

If you really want to avoid the sense of duplication you could use lxml only, but I strongly suggest you don’t do that.

194

answered Oct 10 '22 09:10

Gallaecio

Related questions
                            
                                Get environment variable for current user and for all users in Python
                            
                                How to install pytables 3.2 on anaconda?
                            
                                Python CSV read file and select columns and write to new CSV file
                            
                                wx.ProgressDialog causing seg fault and/or GTK_IS_WINDOW failure when being destroyed
                            
                                Include a commented Python script into a Sphinx-generated documentation
                            
                                Luigi parameter default values and mocks
                            
                                Flask : CSRF verification failed
                            
                                Unusual histogram after image decimation
                            
                                Abort zeromq recv() or poll() from another thread - instantly and without the need to wait for timeout
                            
                                how to compare similarity between two color images?
                            
                                python - Catching signals between sleep calls
                            
                                How to integrate scrapyjs function into a Scrapy project
                            
                                subprocess.call using cygwin instead of cmd on Windows
                            
                                Django model field default based on another model field
                            
                                Mutating / in-place function invokation or adaptation in Python
                            
                                Matplotlib with multiprocessing freeze computer
                            
                                How to send request to url that is set to 'login:admin' in google app engine?
                            
                                Django 1.8.3 - model field validation with related object
                            
                                Flask session cookie not set in Safari
                            
                                Switching Collections and saving in Flask-Mongoengine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy: Modifying elements and fields in a response

Tags:

python

python-2.7

lxml

scrapy

Marcus

People also ask

1 Answers

Gallaecio

Recent Activity

Donate For Us