Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Modifying elements and fields in a response

I'm relatively new to Scrapy, Python and object-oriented programming so apologies if I get any terminology incorrect or am unclear in any way.

I'm trying to write a spider that, as it scrapes items from a response, will also create a modified version of the response to save to file. For example, I'm trying to alter 'src' links to point to scraped files saved locally.

Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.

I've added some code below to illustrate my point. Everything occurs in the spider parse function.

def parse(self, response):

    # Scrape thumbnail URLs using Scrapy selectors
    for post in response.css('.post'): # For each post
        for thumb in post.css('.thumb'): # For each thumbnail
            item = Item() # Create an image item
            item['thumbnail_url'] = []
            item['thumbnail_savepath'] = []
            for x in thumb.xpath('img/@src').extract():
                thumbnail_url = 'https:' + x
                thumbnail_filename = re.search('.*/(.*)', thumbnail_url).group(1)
                thumbnail_savepath = 'thumbnails/' + thumbnail_filename
                item['thumbnail_url'] += [thumbnail_url]
                item['thumbnail_savepath'] += [thumbnail_savepath]

    # Make modified html using lxml
    body_lxml = lxml.html.document_fromstring(response.body)
    for thumbnail in body_lxml.xpath('//img'):
        thumbnail_src = thumbnail.get('src') # Original link address
        thumbnail_path = './thumbnails/' + basename(thumbnail_src) # New link address
        thumbnail.set('src',image_path) # Setting new link address

As the code shows, it iterates through the images to scrape the items using Scrapy selectors, then iterates a second time using lxml for modifying the response. I have to use two different methods to loop over the same elements, which I'm trying to avoid. I'd like to do the scraping and modification in the same for loop if possible.

I was thinking it was possible to use the response.request() method but am struggling to understand how to use this from the documentation and searches online. Is there some method that allows Scrapy to modify individual elements or fields in a response? Any help would be appreciated.

Thanks.

like image 610
Marcus Avatar asked Jul 19 '15 14:07

Marcus


People also ask

Can Scrapy scrape dynamic content?

Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.

How do you respond to Scrapy?

Description. Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.

Can Scrapy handle dynamic websites?

Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors. When this happens, the recommended approach is to find the data source and extract the data from it.


1 Answers

Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.

Parsel Selectors (which Scrapy uses underneath) are designed to extract information, not to edit the underlying HTML. I believe your current approach is the best possible approach.

If you really want to avoid the sense of duplication you could use lxml only, but I strongly suggest you don’t do that.

like image 194
Gallaecio Avatar answered Oct 10 '22 09:10

Gallaecio