Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: store many regex matches in tuple?

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.

Let's say I have a page with the following stored in the variable HTMLtext:

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I'm able to use regex to search for one result:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makespages = ['home'].

How can I get the regex search to continue for the whole text, appending the matched text to this tuple?

(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

like image 878
hao_maike Avatar asked Mar 24 '12 20:03

hao_maike


2 Answers

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning

For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as

  • if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
  • casing issues like <A HREF='foo'>
  • whitespace issues
  • alternate quotes like href='/foo/bar' instead of href="/foo/bar"
  • embedded HTML comments

That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.

However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

like image 158
tchrist Avatar answered Oct 19 '22 14:10

tchrist


Use findall function of re module:

pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)

Output:

['home', 'about', 'music', 'photos', 'stuff', 'contact']
like image 27
ovgolovin Avatar answered Oct 19 '22 15:10

ovgolovin