I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:
example.com
example.com/
http://example.com/
http://example.com
http://example.com?
http://example.com/?
http://example.com//
and output a clean consistent version:
http://example.com/
I looked through std libs and GitHub and couldn't find anything like this
Update
I couldn't find a Python library that implements everything discussed here and in the RFC:
http://en.wikipedia.org/wiki/URL_normalization
So I am writing one now. There is a lot more to this than I initially imagined.
Take a look at urlparse.urlparse()
. I've had good success with it.
note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse
module has been named to urllib.parse
. The corresponding Python3 documentation for urllib.parse
can be found here:
https://docs.python.org/3/library/urllib.parse.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With