Consider the following:
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>
How would you go about taking out the sitemap line with regex in python?
<a href="/sitemap">Sitemap</a>
The following can be used to pull out the anchor tags.
'/<a(.*?)a>/i'
However, there are multiple anchor tags. Also there are multiple hotlink(s) so we can't really use them either?
Don't use a regex. Use BeautfulSoup, an HTML parser.
from BeautifulSoup import BeautifulSoup
html = \
"""
<div id=hotlinklist>
<a href="foo1.com">Foo1</a>
<div id=hotlink>
<a href="/">Home</a>
</div>
<div id=hotlink>
<a href="/extract">Extract</a>
</div>
<div id=hotlink>
<a href="/sitemap">Sitemap</a>
</div>
</div>"""
soup = BeautifulSoup(html)
soup.findAll("div",id="hotlink")[2].a
# <a href="/sitemap">Sitemap</a>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With