Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?
Edit: by body of text I just meant plain text - no HTML
replace() The replace() method of the Location interface replaces the current resource with the one at the provided URL.
You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:
/(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.
Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?
import re
urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
def urlify2(value):
return urlfinder.sub(r'<a href="\1">\1</a>', value)
call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.
I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:
_urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
def linkify(text, maxlinklength):
def replacewithlink(matchobj):
url = matchobj.group(0)
text = unicode(url)
if text.startswith('http://'):
text = text.replace('http://', '', 1)
elif text.startswith('https://'):
text = text.replace('https://', '', 1)
if text.startswith('www.'):
text = text.replace('www.', '', 1)
if len(text) > maxlinklength:
halflength = maxlinklength / 2
text = text[0:halflength] + '...' + text[len(text) - halflength:]
return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
if text != None and text != '':
return _urlfinderregex.sub(replacewithlink, text)
else:
return ''
You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.
/\w+:\/\/[^\s]+/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With