I have a short piece of HTML that I would like to run through using BeautifulSoup. I've got basic navigation down, but this one has me stumped.
Here's an example piece of HTML (totally made it up):
<div class="textbox">
Buying this item will cost you
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
silver credits and
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
golden credits
</div>
Using the 'alt' attributes of the img tags I would like to see the following result: Buying this item will cost you 1 silver credits and 1 golden credits
I have no idea how to loop through the div-tag sequentially. I can do the following to extract all the text contained in the div-tag
html = BeautifulSoup(string)
print html.get_text()
to get all the text contained in the div-tag, but that would give me result like this: Buying this item will cost you silver credits and golden credits
Likewise, I can get the values of the alt-attributes from the img-tags by doing this:
html = BeautifulSoup(string).img
print html['alt']
But of course this only gives me the attribute value.
How can I iterate through all these elements in the correct order? Is it possible to read the text in the div-element and the attibutes of the img-element in consecutive order?
You can loop through all children of a tag, including text; test for their type to see if they are Tag
or NavigableString
objects:
from bs4 import Tag
result = []
for child in html.find('div', class_='textbox').children:
if isinstance(child, Tag):
result.append(child.get('alt', ''))
else:
result.append(child.strip())
print ' '.join(result)
Demo:
>>> from bs4 import BeautifulSoup, Tag
>>> sample = '''\
... <div class="textbox">
... Buying this item will cost you
... <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
... silver credits and
... <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
... golden credits
... </div>
... '''
>>> html = BeautifulSoup(sample)
>>> result = []
>>> for child in html.find('div', class_='textbox').children:
... if isinstance(child, Tag):
... result.append(child.get('alt', ''))
... else:
... result.append(child.strip())
...
>>> print ' '.join(result)
Buying this item will cost you 1 silver credits and 1 golden credits
This can also be done with a single XPath query:
//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt
Unfortunately, BeautifulSoup doesn't support XPath, but lxml does:
import lxml.html
root = lxml.html.fromstring("""
<div class="textbox">
Buying this item will cost you
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
silver credits and
<img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
golden credits
</div>
""")
pieces = root.xpath('//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt')
print ' '.join(map(str.strip, pieces))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With