Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove <br> tags from a parsed Beautiful Soup list?

Tags:

I'm currently getting into a for loop with all the rows I want:

page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):

At this point, I have my information, but the

<br />

tags are ruining my output.

What's the cleanest way to remove these?

like image 850
mamontazeri Avatar asked May 08 '11 03:05

mamontazeri


People also ask

How do I remove a tag from beautiful soup?

BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.

Can Beautiful Soup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.


2 Answers

If you want to translate the <br />'s to newlines, do something like this:

def text_with_newlines(elem):
    text = ''
    for e in elem.recursiveChildGenerator():
        if isinstance(e, basestring):
            text += e.strip()
        elif e.name == 'br':
            text += '\n'
    return text
like image 39
Mu Mind Avatar answered Oct 19 '22 05:10

Mu Mind


for e in soup.findAll('br'):
    e.extract()
like image 61
Kabie Avatar answered Oct 19 '22 06:10

Kabie