I have urls formatted as:
google.com
www.google.com
http://google.com
http://www.google.com
I would like to convert all type of links to a uniform format, starting with http://
http://google.com
How can I prepend URLs with http://
using Python?
Python do have builtin functions to treat that correctly, like
p = urlparse.urlparse(my_url, 'http')
netloc = p.netloc or p.path
path = p.path if p.netloc else ''
if not netloc.startswith('www.'):
netloc = 'www.' + netloc
p = urlparse.ParseResult('http', netloc, path, *p[3:])
print(p.geturl())
If you want to remove (or add) the www
part, you have to edit the .netloc
field of the resulting object before calling .geturl()
.
Because ParseResult
is a namedtuple, you cannot edit it in-place, but have to create a new object.
PS:
For Python3, it should be urllib.parse.urlparse
I found it easy to detect the protocol with regex and then append it if missing:
import re
def formaturl(url):
if not re.match('(?:http|ftp|https)://', url):
return 'http://{}'.format(url)
return url
url = 'test.com'
print(formaturl(url)) # http://test.com
url = 'https://test.com'
print(formaturl(url)) # https://test.com
I hope it helps!
For the formats that you mention in your question, you can do something as simple as:
def convert(url):
if url.startswith('http://www.'):
return 'http://' + url[len('http://www.'):]
if url.startswith('www.'):
return 'http://' + url[len('www.'):]
if not url.startswith('http://'):
return 'http://' + url
return url
But please note that there are probably other formats that you are not anticipating. In addition, keep in mind that the output URL (according to your definitions) will not necessarily be a valid one (i.e., the DNS will not be able to translate it into a valid IP address).
If you URLs are a string type you could just concatenate.
one = "https://"
two = "www.privateproperty.co.za"
link = "".join((one, two))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With