I'm relatively new to Scrapy, Python and object-oriented programming so apologies if I get any terminology incorrect or am unclear in any way.
I'm trying to write a spider that, as it scrapes items from a response, will also create a modified version of the response to save to file. For example, I'm trying to alter 'src' links to point to scraped files saved locally.
Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.
I've added some code below to illustrate my point. Everything occurs in the spider parse function.
def parse(self, response):
# Scrape thumbnail URLs using Scrapy selectors
for post in response.css('.post'): # For each post
for thumb in post.css('.thumb'): # For each thumbnail
item = Item() # Create an image item
item['thumbnail_url'] = []
item['thumbnail_savepath'] = []
for x in thumb.xpath('img/@src').extract():
thumbnail_url = 'https:' + x
thumbnail_filename = re.search('.*/(.*)', thumbnail_url).group(1)
thumbnail_savepath = 'thumbnails/' + thumbnail_filename
item['thumbnail_url'] += [thumbnail_url]
item['thumbnail_savepath'] += [thumbnail_savepath]
# Make modified html using lxml
body_lxml = lxml.html.document_fromstring(response.body)
for thumbnail in body_lxml.xpath('//img'):
thumbnail_src = thumbnail.get('src') # Original link address
thumbnail_path = './thumbnails/' + basename(thumbnail_src) # New link address
thumbnail.set('src',image_path) # Setting new link address
As the code shows, it iterates through the images to scrape the items using Scrapy selectors, then iterates a second time using lxml for modifying the response. I have to use two different methods to loop over the same elements, which I'm trying to avoid. I'd like to do the scraping and modification in the same for loop if possible.
I was thinking it was possible to use the response.request() method but am struggling to understand how to use this from the documentation and searches online. Is there some method that allows Scrapy to modify individual elements or fields in a response? Any help would be appreciated.
Thanks.
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.
Description. Scrapy can crawl websites using the Request and Response objects. The request objects pass over the system, uses the spiders to execute the request and get back to the request when it returns a response object.
Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors. When this happens, the recommended approach is to find the data source and extract the data from it.
Currently, I'm scraping data using Scrapy's selectors and modifying the response using lxml. However, I want to use Scrapy's methods to do the modifications instead of lxml as using both Scrapy selectors and lxml means essentially doubling code to locate the same elements in a response.
Parsel Selectors (which Scrapy uses underneath) are designed to extract information, not to edit the underlying HTML. I believe your current approach is the best possible approach.
If you really want to avoid the sense of duplication you could use lxml only, but I strongly suggest you don’t do that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With