Can <script>
tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?
Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'html.parser') >>> for s in soup.select('script'): >>> s.extract() >>> soup baba
Updated answer for those who might need for future reference: The correct answer is. decompose()
. You can use different ways but decompose
works in place.
Example usage:
soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>') soup.i.decompose() print str(soup) #prints '<p>This is a slimy text and</p>'
Pretty useful to get rid of detritus like <script>
, <img>
and so forth.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With