I'm using scrapy to crawl a multilingual site. For each object, versions in three different languages exist. I'm using the search as a starting point. Unfortunately the search contains URLs in various languages, which causes problems when parsing.
Therefore I'd like to preprocess the URLs before they get sent out. If they contain a specific string, I want to replace that part of the URL.
My spider extends the CrawlSpider
. I looked at the docs and found the make_request_from _url(url)
method, which led to this attempt:
def make_requests_from_url(self, url):
"""
Override the original function go make sure only german URLs are
being used. If french or italian URLs are detected, they're
rewritten.
"""
if '/f/suche' in url:
self.log('French URL was rewritten: %s' % url)
url = url.replace('/f/suche/pages/', '/d/suche/seiten/')
elif '/i/suche' in url:
self.log('Italian URL was rewritten: %s' % url)
url = url.replace('/i/suche/pagine/', '/d/suche/seiten/')
return super(MyMultilingualSpider, self).make_requests_from_url(url)
But that does not work for some reason. What would be the best way to rewrite URLs before requesting them? Maybe via a rule callback?
Probably worth nothing an example since it took me about 30 minutes to figure it out:
rules = [
Rule(SgmlLinkExtractor(allow = (all_subdomains,)), callback='parse_item', process_links='process_links')
]
def process_links(self,links):
for link in links:
link.url = "something_to_prepend%ssomething_to_append" % link.url
return links
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With