Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Beautiful Soup, how do I iterate over all embedded text?

Let's say I wanted to remove vowels from HTML:

<a href="foo">Hello there!</a>Hi!

becomes

<a href="foo">Hll thr!</a>H!

I figure this is a job for Beautiful Soup. How can I select the text in between tags and operate on it like this?

like image 818
mike Avatar asked May 06 '09 18:05

mike


People also ask

What is the difference between Find_all () and find () in beautiful soup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.

What is a supported parser for Beautiful Soup?

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser.


1 Answers

Suppose the variable test_html has the following html content:

<html>
<head><title>Test title</title></head>
<body>
<p>Some paragraph</p>
Useless Text
<a href="http://stackoverflow.com">Some link</a>not a link
<a href="http://python.org">Another link</a>
</body></html>

Just do this:

from BeautifulSoup import BeautifulSoup

test_html = load_html_from_above()
soup = BeautifulSoup(test_html)

for t in soup.findAll(text=True):
    text = unicode(t)
    for vowel in u'aeiou':
        text = text.replace(vowel, u'') 
    t.replaceWith(text)

print soup

That prints:

<html>
<head><title>Tst ttl</title></head>
<body>
<p>Sm prgrph</p>
Uslss Txt
<a href="http://stackoverflow.com">Sm lnk</a>nt  lnk
<a href="http://python.org">Anthr lnk</a>
</body></html>

Note that the tags and attributes are untouched.

like image 107
nosklo Avatar answered Sep 29 '22 06:09

nosklo