Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove a query from a url?

I am using scrapy to crawl a site which seems to be appending random values to the query string at the end of each URL. This is turning the crawl into a sort of an infinite loop.

How do i make scrapy to neglect the query string part of the URL's?

like image 407
Sanket Gupta Avatar asked Dec 19 '11 20:12

Sanket Gupta


People also ask

How do I remove a query param from URL in Google Analytics?

One way to remove query parameters from pages is through the View Settings. Under Admin > View Settings > Exclude Query Parameters, list the query parameters that you want to exclude from your page paths.

What is the query of a URL?

A query string is the portion of a URL where data is passed to a web application and/or back-end database. The reason we need query strings is that the HTTP protocol is stateless by design. For a website to be anything more than a brochure, you need to maintain state (store data).


1 Answers

See urllib.urlparse

Example code:

from urlparse import urlparse
o = urlparse('http://url.something.com/bla.html?querystring=stuff')

url_without_query_string = o.scheme + "://" + o.netloc + o.path

Example output:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> o = urlparse('http://url.something.com/bla.html?querystring=stuff')
>>> url_without_query_string = o.scheme + "://" + o.netloc + o.path
>>> print url_without_query_string
http://url.something.com/bla.html
>>> 
like image 185
Sjaak Trekhaak Avatar answered Sep 25 '22 08:09

Sjaak Trekhaak