Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python regex fails to identify markdown links

I am trying to write a regex in python to find urls in a Markdown text string. Once a url is found, I want to check if this is wrapped by a markdown link: text I am having problem with the latter. I am using a regex - link_exp - to search, but the results are not what I expected, and cannot get my head around it.

This is probably something simple that I am not seeing.

here goes the code and explanation of the link_exp regex

import re

text = '''
[Vocoder](http://en.wikipedia.org/wiki/Vocoder )
[Turing]( http://en.wikipedia.org/wiki/Alan_Turing)
[Autotune](http://en.wikipedia.org/wiki/Autotune)
http://en.wikipedia.org/wiki/The_Voder
'''

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text) #find all urls
for url in urls:
    url = re.escape(url)
    link_exp = re.compile('\[.*\]\(\s*{0}\s*\)'.format(url) ) # expression with url wrapped in link syntax.     
    search = re.search(link_exp, text)
    if search != None:
        print url

# expression should translate to:
# \[ - literal [
# .* - any character or no character 
# \] - literal ]
# \( - literal (
# \s* - whitespaces or no whitespace 
# {0} - the url
# \s* - whitespaces or no whitespace 
# \) - literal )
# NOTE: I am including whitespaces to encompass cases like [foo]( http://www.foo.sexy   )  

the output I get is only:

http\:\/\/en\.wikipedia\.org\/wiki\/Vocoder

which means the expression is only finding the link with a whitespace before the closing parenthesis. This is not only what I want to, but only one case links without white spaces should be considered.

Do you think you can help me on this one?
cheers

like image 688
MrCastro Avatar asked Feb 13 '23 02:02

MrCastro


1 Answers

The problem here is your regex for pulling out the URL's in the first place, which is including ) inside the URLs. This means you are looking for the closing parenthesis twice. This happens for everything bar the first one (the space saves you there).

I'm not quite sure what each part of your URL regex is trying to do, but the portion that says: [$-_@.&+], is including a range from $ (ASCII 36) to _ (ASCII 137), which includes a huge number of characters you probably don't mean, including the ).

Instead of looking for URLs, and then checking to see if they are in the link, why not do both at once? This way your URL regex can be lazier, because the extra constraints make it less likely to be anything else:

# Anything that isn't a square closing bracket
name_regex = "[^]]+"
# http:// or https:// followed by anything but a closing paren
url_regex = "http[s]?://[^)]+"

markup_regex = '\[({0})]\(\s*({1})\s*\)'.format(name_regex, url_regex)

for match in re.findall(markup_regex, text):
    print match

Result:

('Vocoder', 'http://en.wikipedia.org/wiki/Vocoder ')
('Turing', 'http://en.wikipedia.org/wiki/Alan_Turing')
('Autotune', 'http://en.wikipedia.org/wiki/Autotune')

You could probably improve the URL regex if you need to be stricter.

like image 91
Jon Betts Avatar answered Feb 25 '23 05:02

Jon Betts