Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function in Python to clean up and normalize a URL

Tags:

python

url

I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:

example.com
example.com/
http://example.com/
http://example.com
http://example.com?
http://example.com/?
http://example.com//

and output a clean consistent version:

http://example.com/

I looked through std libs and GitHub and couldn't find anything like this

Update

I couldn't find a Python library that implements everything discussed here and in the RFC:

http://en.wikipedia.org/wiki/URL_normalization

So I am writing one now. There is a lot more to this than I initially imagined.

like image 644
nikcub Avatar asked Mar 10 '11 16:03

nikcub


1 Answers

Take a look at urlparse.urlparse(). I've had good success with it.


note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html

like image 197
samplebias Avatar answered Sep 22 '22 02:09

samplebias