Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect URLs in a string and wrap with "<a href..." tag

Tags:

python

html

regex

I am looking to write something that seems like it should be easy enough, but for whatever reason I'm having a tough time getting my head around it.

I am looking to write a python function that, when passed a string, will pass that string back with HTML encoding around URLs.

unencoded_string = "This is a link - http://google.com"

def encode_string_with_links(unencoded_string):
    # some sort of regex magic occurs
    return encoded_string

print encoded_string

'This is a link - <a href="http://google.com">http://google.com</a>'

Thank you!

like image 275
Patrick Harrington Avatar asked Jul 01 '09 20:07

Patrick Harrington


2 Answers

Googled solutions:

#---------- find_urls.py----------#
# Functions to identify and extract URLs and email addresses

import re

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ \n\r"]+ # or stuff then space, newline, tab, quote
                    [\w/]) # resource name ends in alphanumeric or slash
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')
    pat_email = re.compile(r'''
                    (?xm)  # verbose identify URLs in text (and multiline)
                 (?=^.{11} # Mail header matcher
         (?<!Message-ID:|  # rule out Message-ID's as best possible
             In-Reply-To)) # ...and also In-Reply-To
                    (.*?)( # must grab to email to allow prior lookbehind
        ([A-Za-z0-9-]+\.)? # maybe an initial part: [email protected]
             [A-Za-z0-9-]+ # definitely some local user: [email protected]
                         @ # ...needs an at sign in the middle
              (\w+\.?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')

    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})

    for email in re.findall(pat_email, text):
       text = text.replace(email[1], '<a href="mailto:%(email)s">%(email)s</a>' % {"email" : email[1]})

    return text

if __name__ == '__main__':
    print fix_urls("test http://google.com asdasdasd some more text")

EDIT: Adjusted to your needs

like image 53
tefozi Avatar answered Oct 13 '22 04:10

tefozi


The "regex magic" you need is just sub (which does a substitution):

def encode_string_with_links(unencoded_string):
  return URL_REGEX.sub(r'<a href="\1">\1</a>', unencoded_string)

URL_REGEX could be something like:

URL_REGEX = re.compile(r'''((?:mailto:|ftp://|http://)[^ <>'"{}|\\^`[\]]*)''')

This is a pretty loose regex for URLs: it allows mailto, http and ftp schemes, and after that pretty much just keeps going until it runs into an "unsafe" character (except percent, which you want to allow for escapes). You could make it more strict if you need to. For example, you could require that percents are followed by a valid hex escape, or only allow one pound sign (for the fragment) or enforce the order between query parameters and fragments. This should be enough to get you started, though.

like image 25
Laurence Gonsalves Avatar answered Oct 13 '22 03:10

Laurence Gonsalves