Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, BeautifulSoup - <div> text and <img> attributes in correct order

I have a short piece of HTML that I would like to run through using BeautifulSoup. I've got basic navigation down, but this one has me stumped.

Here's an example piece of HTML (totally made it up):

<div class="textbox">
    Buying this item will cost you 
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    silver credits and
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    golden credits
</div>

Using the 'alt' attributes of the img tags I would like to see the following result: Buying this item will cost you 1 silver credits and 1 golden credits

I have no idea how to loop through the div-tag sequentially. I can do the following to extract all the text contained in the div-tag

html = BeautifulSoup(string)
print html.get_text()

to get all the text contained in the div-tag, but that would give me result like this: Buying this item will cost you silver credits and golden credits

Likewise, I can get the values of the alt-attributes from the img-tags by doing this:

html = BeautifulSoup(string).img
print html['alt']

But of course this only gives me the attribute value.

How can I iterate through all these elements in the correct order? Is it possible to read the text in the div-element and the attibutes of the img-element in consecutive order?

like image 331
romatthe Avatar asked Dec 15 '13 02:12

romatthe


2 Answers

You can loop through all children of a tag, including text; test for their type to see if they are Tag or NavigableString objects:

from bs4 import Tag

result = []
for child in html.find('div', class_='textbox').children:
    if isinstance(child, Tag):
        result.append(child.get('alt', ''))
    else:
        result.append(child.strip())

print ' '.join(result)

Demo:

>>> from bs4 import BeautifulSoup, Tag
>>> sample = '''\
... <div class="textbox">
...     Buying this item will cost you 
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     silver credits and
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     golden credits
... </div>
... '''
>>> html = BeautifulSoup(sample)
>>> result = []
>>> for child in html.find('div', class_='textbox').children:
...     if isinstance(child, Tag):
...         result.append(child.get('alt', ''))
...     else:
...         result.append(child.strip())
... 
>>> print ' '.join(result)
Buying this item will cost you 1 silver credits and 1 golden credits
like image 65
Martijn Pieters Avatar answered Oct 08 '22 15:10

Martijn Pieters


This can also be done with a single XPath query:

//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt

Unfortunately, BeautifulSoup doesn't support XPath, but lxml does:

import lxml.html

root = lxml.html.fromstring("""
    <div class="textbox">
        Buying this item will cost you 
        <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
        silver credits and
        <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
        golden credits
    </div>
""")

pieces = root.xpath('//div[@class="textbox"]/text() | //div[@class="textbox"]/img/@alt')
print ' '.join(map(str.strip, pieces))
like image 1
Blender Avatar answered Oct 08 '22 15:10

Blender