Find and replace URLS in a block of text, return text + list of URLS

Question

I'm trying to find a way to take a block of text, replace all URLs in that text with some other text, then return the new text chunk and a list of URLs it found. Something like:

text = """This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"""
text, urls = FindURLs(text, "{{URL}}")

Should give:

text = "This is some text {{URL}} blah blah {{URL}} lol"
urls = ["www.google.com", "http://www.imgur.com/12345.jpg"]

I know this will involve some regex - I've found some seemingly good URL detection regex here: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

I'm pretty rubbish with regex, though, so I'm finding that getting it to do what I want with python quite tricky. The order that the URLs are returned in doesn't really matter.

Thanks :)

obsoleter · Accepted Answer

The regular expression here should be liberal enough to catch urls without http or www.

Here's some simplistic python code that performs the text substitution and gives you a list of the results:

import re

url_regex = re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|$([^\s()<>\[\]]+|(\([^\s()<>\[\]]+$))*\))+(?:$([^\s()<>\[\]]+|(\([^\s()<>\[\]]+$))*\)|[^\s`!(){};:'".,<>?]))""")

text = "This is some text www.google.com blah blah http://www.imgur.com/12345.jpg lol"
matches = []

def process_match(m):
    matches.append(m.group(0))
    return '{{URL}}'

new_text = url_regex.sub(process_match, text)

print new_text
print matches

Find and replace URLS in a block of text, return text + list of URLS

Tags:

python

combatdave

1 Answers

obsoleter

Recent Activity

Donate For Us

Find and replace URLS in a block of text, return text + list of URLS

Tags:

python

combatdave

1 Answers

obsoleter

Related questions

Recent Activity

Donate For Us