I'm currently getting into a for loop with all the rows I want:
page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
At this point, I have my information, but the
<br />
tags are ruining my output.
What's the cleanest way to remove these?
BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.
Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
If you want to translate the <br />
's to newlines, do something like this:
def text_with_newlines(elem):
text = ''
for e in elem.recursiveChildGenerator():
if isinstance(e, basestring):
text += e.strip()
elif e.name == 'br':
text += '\n'
return text
for e in soup.findAll('br'):
e.extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With