Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting html stripped of script and style tags with BeautifulSoup?

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.

soup = BeautifulSoup(html)
for script in soup("script"):
    soup.script.extract()

for style in soup("style"):
    soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)

contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?

Or is there an even better solution to accomplish what I want?

like image 751
Nathan Avatar asked Oct 06 '10 15:10

Nathan


People also ask

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.

How do you remove all HTML tags in Python?

The re. sub() method will remove all of the HTML tags in the string by replacing them with empty strings. Copied!

How do you remove tags from BeautifulSoup?

BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.


1 Answers

unicode( soup ) gives you the html.

Also what you want is this:

for elem in soup.findAll(['script', 'style']):
    elem.extract()
like image 80
Jochen Ritzel Avatar answered Oct 18 '22 12:10

Jochen Ritzel