Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rewrite scrapy URLs before sending the request

Tags:

python

scrapy

I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.

Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.

My spider extends the CrawlSpider. I looked at the docs and found the make_request_from _url(url) method, which led to this attempt:

def make_requests_from_url(self, url):                                                          
    """                                                                                         
    Override the original function go make sure only german URLs are                            
    being used. If french or italian URLs are detected, they're                                 
    rewritten.                                                                                  

    """                                                                                         
    if '/f/suche' in url:                                                                       
        self.log('French URL was rewritten: %s' % url)                                          
        url = url.replace('/f/suche/pages/', '/d/suche/seiten/')                                
    elif '/i/suche' in url:                                                                     
        self.log('Italian URL was rewritten: %s' % url)                                            
        url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')                                  
    return super(MyMultilingualSpider, self).make_requests_from_url(url)                                                  

But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?

like image 590
Danilo Bargen Avatar asked Dec 15 '22 05:12

Danilo Bargen


1 Answers

Probably worth nothing an example since it took me about 30 minutes to figure it out:

rules = [
    Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]

def process_links(self,links):
    for link in links:
        link.url = "something_to_prepend%ssomething_to_append" % link.url
    return links
like image 102
Tony Avatar answered Dec 25 '22 11:12

Tony