Remove all style, scripts, and html tags from an html page

Tags:

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

This is working to remove the script

859

asked Jun 01 '15 03:06

htifcs

2 Answers

It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

120

answered Oct 06 '22 04:10

jamescampbell

You can use decompose to completely remove the tags from the document and stripped_strings generator to retrieve the tag content.

def clean_me(html):
    soup = BeautifulSoup(html)
    for s in soup(['script', 'style']):
        s.decompose()
    return ' '.join(soup.stripped_strings)

>>> clean_me(testhtml) 
'THIS IS AN EXAMPLE I need this text captured And this'

answered Oct 06 '22 05:10

styvane

Related questions
                            
                                How to invert black and white with scikit-image?
                            
                                Importing bs4 in Python 3.5
                            
                                Python, How to Send data over TCP
                            
                                Visualize MNIST dataset using OpenCV or Matplotlib/Pyplot
                            
                                assertTrue() in pytest to assert empty lists
                            
                                Exception: "dot" not found in path in python on mac
                            
                                Install issues with 'lr_utils' in python
                            
                                Directory Listing based on time [duplicate]
                            
                                Python: Anyway to use map to get first element of a tuple
                            
                                Warning: The Command Line Tools for Xcode don't appear to be installed; most ports will likely fail to build [closed]
                            
                                Get contents by class names using Beautiful Soup
                            
                                I don't understand encode and decode in Python (2.7.3)
                            
                                Empty list boolean value
                            
                                Finding the currently selected tab of Ttk Notebook
                            
                                Emulating a browser to download a file?
                            
                                Matplotlib - How to remove a specific line or curve
                            
                                Python Pandas: DataFrame filter negative values
                            
                                Filtering a wav file using python
                            
                                Matplotlib compilation error: TypeError: unorderable types: str() < int() [duplicate]
                            
                                How to import and use python Levenshtein extension on OSX?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove all style, scripts, and html tags from an html page

Tags:

python

html

beautifulsoup

htifcs

People also ask

2 Answers

jamescampbell

styvane

Recent Activity

Donate For Us