Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I remove script tags with BeautifulSoup?

Can <script> tags and all of their contents be removed from HTML with BeautifulSoup, or do I have to use Regular Expressions or something else?

like image 572
Sam Avatar asked Apr 08 '11 17:04

Sam


People also ask

How do you scrape a tag with BeautifulSoup?

Step-by-step Approach. Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.

What function in BeautifulSoup will remove a tag from the HTML tree and destroy it?

Tag. decompose() removes a tag from the tree of a given HTML document, then completely destroys it and its contents.

Can BeautifulSoup handle broken HTML?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2.


2 Answers

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'html.parser') >>> for s in soup.select('script'): >>>    s.extract() >>> soup baba 
like image 57
Fábio Diniz Avatar answered Oct 28 '22 15:10

Fábio Diniz


Updated answer for those who might need for future reference: The correct answer is. decompose(). You can use different ways but decompose works in place.

Example usage:

soup = BeautifulSoup('<p>This is a slimy text and <i> I am slimer</i></p>') soup.i.decompose() print str(soup) #prints '<p>This is a slimy text and</p>' 

Pretty useful to get rid of detritus like <script>, <img> and so forth.

like image 40
Abhishek Dujari Avatar answered Oct 28 '22 13:10

Abhishek Dujari