I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop.
How do i make scrapy to neglect the query string part of the URL's?
One way to remove query parameters from pages is through the View Settings. Under Admin > View Settings > Exclude Query Parameters, list the query parameters that you want to exclude from your page paths.
A query string is the portion of a URL where data is passed to a web application and/or back-end database. The reason we need query strings is that the HTTP protocol is stateless by design. For a website to be anything more than a brochure, you need to maintain state (store data).
See urllib.urlparse
Example code:
from urlparse import urlparse
o = urlparse('http://url.something.com/bla.html?querystring=stuff')
url_without_query_string = o.scheme + "://" + o.netloc + o.path
Example output:
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> o = urlparse('http://url.something.com/bla.html?querystring=stuff')
>>> url_without_query_string = o.scheme + "://" + o.netloc + o.path
>>> print url_without_query_string
http://url.something.com/bla.html
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With