I have a short piece of HTML that I would like to run through using BeautifulSoup. I've got basic navigation down, but this one has me stumped.
Here's an example piece of HTML (totally made it up):
<div class="textbox">
    Buying this item will cost you 
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    silver credits and
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    golden credits
</div>
Using the 'alt' attributes of the img tags I would like to see the following result: Buying this item will cost you 1 silver credits and 1 golden credits
I have no idea how to loop through the div-tag sequentially. I can do the following to extract all the text contained in the div-tag
html = BeautifulSoup(string)
print html.get_text()
to get all the text contained in the div-tag, but that would give me result like this: Buying this item will cost you silver credits and golden credits
Likewise, I can get the values of the alt-attributes from the img-tags by doing this:
html = BeautifulSoup(string).img
print html['alt']
But of course this only gives me the attribute value.
How can I iterate through all these elements in the correct order? Is it possible to read the text in the div-element and the attibutes of the img-element in consecutive order?
You can loop through all children of a tag, including text; test for their type to see if they are Tag or NavigableString objects:
from bs4 import Tag
result = []
for child in html.find('div', class_='textbox').children:
    if isinstance(child, Tag):
        result.append(child.get('alt', ''))
    else:
        result.append(child.strip())
print ' '.join(result)
Demo:
>>> from bs4 import BeautifulSoup, Tag
>>> sample = '''\
... <div class="textbox">
...     Buying this item will cost you 
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     silver credits and
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     golden credits
... </div>
... '''
>>> html = BeautifulSoup(sample)
>>> result = []
>>> for child in html.find('div', class_='textbox').children:
...     if isinstance(child, Tag):
...         result.append(child.get('alt', ''))
...     else:
...         result.append(child.strip())
... 
>>> print ' '.join(result)
Buying this item will cost you 1 silver credits and 1 golden credits
                        This can also be done with a single XPath query:
//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt
Unfortunately, BeautifulSoup doesn't support XPath, but lxml does:
import lxml.html
root = lxml.html.fromstring("""
    <div class="textbox">
        Buying this item will cost you 
        <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
        silver credits and
        <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
        golden credits
    </div>
""")
pieces = root.xpath('//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt')
print ' '.join(map(str.strip, pieces))
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With