Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python -- Regex -- How to find a string between two sets of strings

Consider the following:

<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>

How would you go about taking out the sitemap line with regex in python?

<a href="/sitemap">Sitemap</a>

The following can be used to pull out the anchor tags.

'/<a(.*?)a>/i'

However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?

like image 626
un33k Avatar asked May 11 '09 20:05

un33k


1 Answers

Don't use a regex. Use BeautfulSoup, an HTML parser.

from BeautifulSoup import BeautifulSoup

html = \
"""
<div id=hotlinklist>
  <a href="foo1.com">Foo1</a>
  <div id=hotlink>
    <a href="/">Home</a>
  </div>
  <div id=hotlink>
    <a href="/extract">Extract</a>
  </div>
  <div id=hotlink>
    <a href="/sitemap">Sitemap</a>
  </div>
</div>"""

soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a

# <a href="/sitemap">Sitemap</a>
like image 102
Unknown Avatar answered Oct 18 '22 12:10

Unknown