I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.
soup = BeautifulSoup(html)
for script in soup("script"):
soup.script.extract()
for style in soup("style"):
soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)
contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents
list and piece the html back together excluding the script & style tags?
Or is there an even better solution to accomplish what I want?
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
The re. sub() method will remove all of the HTML tags in the string by replacing them with empty strings. Copied!
BeautifulSoup has a built in method called extract() that allows you to remove a tag or string from the tree. Once you've located the element you want to get rid of, let's say it's named i_tag , calling i_tag. extract() will remove the element and return it at the same time.
unicode( soup )
gives you the html.
Also what you want is this:
for elem in soup.findAll(['script', 'style']):
elem.extract()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With