I'm trying to find a way to take a block of text, replace all URLs in that text with some other text, then return the new text chunk and a list of URLs it found. Something like:
text = """This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"""
text, urls = FindURLs(text, "{{URL}}")
Should give:
text = "This is some text {{URL}} blah blah {{URL}} lol"
urls = ["www.google.com", "http://www.imgur.com/12345.jpg"]
I know this will involve some regex - I've found some seemingly good URL detection regex here: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/
I'm pretty rubbish with regex, though, so I'm finding that getting it to do what I want with python quite tricky. The order that the URLs are returned in doesn't really matter.
Thanks :)
The regular expression here should be liberal enough to catch urls without http or www.
Here's some simplistic python code that performs the text substitution and gives you a list of the results:
import re
url_regex = re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>\[\]]+|\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\))+(?:\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\)|[^\s`!(){};:'".,<>?\[\]]))""")
text = "This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"
matches = []
def process_match(m):
matches.append(m.group(0))
return '{{URL}}'
new_text = url_regex.sub(process_match, text)
print new_text
print matches
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With