Python: store many regex matches in tuple?

Question

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.

Let's say I have a page with the following stored in the variable HTMLtext:

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I'm able to use regex to search for one result:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makespages = ['home'].

How can I get the regex search to continue for the whole text, appending the matched text to this tuple?

(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

tchrist · Accepted Answer

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning

For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as

if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
casing issues like <A HREF='foo'>
whitespace issues
alternate quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments

That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.

However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

ovgolovin · Answer

Use findall function of re module:

pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)

Output:

['home', 'about', 'music', 'photos', 'stuff', 'contact']

Python: store many regex matches in tuple?

Tags:

python

html

regex

parsing

hao_maike

2 Answers

Obligatory Warning

tchrist

ovgolovin

Recent Activity

Donate For Us

Python: store many regex matches in tuple?

Tags:

python

html

regex

parsing

hao_maike

2 Answers

Obligatory Warning

tchrist

ovgolovin

Related questions

Recent Activity

Donate For Us