Getting html stripped of script and style tags with BeautifulSoup?

Tags:

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.

soup = BeautifulSoup(html)
for script in soup("script"):
    soup.script.extract()

for style in soup("style"):
    soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)

contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?

Or is there an even better solution to accomplish what I want?

751

asked Oct 06 '10 15:10

Nathan

1 Answers

unicode( soup ) gives you the html.

Also what you want is this:

for elem in soup.findAll(['script', 'style']):
    elem.extract()

answered Oct 18 '22 12:10

Jochen Ritzel

Related questions
                            
                                The memory usage reported by guppy differ from ps command
                            
                                Memcached getting null for String set with python and then get from Java
                            
                                Prevent a console app from closing when not invoked from an existing terminal?
                            
                                Overwrite auto_now for unittest
                            
                                Ugly combination of generator expression with for loop
                            
                                Reverse Search Best Practices?
                            
                                cx_Oracle and output variables
                            
                                Retrieving my own data via FaceBook API
                            
                                Intelligent screen scraping using different proxies and user-agents randomly?
                            
                                Change web service url for a suds client on runtime (keeping the wsdl)
                            
                                pyodbc on SQL Server - How can I do an insert and get the row ID back?
                            
                                Django filter by hour
                            
                                Need a way to determine if a file is done being written to
                            
                                Verbally format a number in Python
                            
                                Unstructured Text to Structured Data
                            
                                How to configure Eclipse for PyDev? Python doesn't appear in Preferences window
                            
                                Dynamically loading Python application code from database under Google App Engine
                            
                                _really_ disable GtkTreeView searching
                            
                                twisted: difference between `defer.execute` and `threads.deferToThread`
                            
                                How can I remove the axes in an Axes3D class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting html stripped of script and style tags with BeautifulSoup?

Tags:

python

html-parsing

beautifulsoup

python-2.6

Nathan

People also ask

1 Answers

Jochen Ritzel

Recent Activity

Donate For Us