Extracting text from HTML file using Python

People also ask

How do I extract HTML from text in Python?

To extract text from HTML file using Python, we can use BeautifulSoup. We call urllib. request. urlopen with the url we want to get the HTML text from.

The best piece of code I found for extracting text without getting javascript or not wanted things :

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

pip install beautifulsoup4

html2text is a Python program that does a pretty good job at this.

NOTE: NTLK no longer supports clean_html function

Original answer below, and an alternative in the comments sections.

Use NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

Related questions
                            
                                Display image as grayscale using matplotlib
                            
                                Django set default form values
                            
                                Initializing a list to a known number of elements in Python [duplicate]
                            
                                Detect and exclude outliers in a pandas DataFrame
                            
                                How to split a dataframe string column into two columns?
                            
                                What is the difference between Jupyter Notebook and JupyterLab?
                            
                                Python, Matplotlib, subplot: How to set the axis range?
                            
                                Why is 'x' in ('x',) faster than 'x' == 'x'?
                            
                                How to specify "nullable" return type with type hints
                            
                                How to override the [] operator in Python?
                            
                                Counting the number of distinct keys in a dictionary in Python
                            
                                How do I implement interfaces in python?
                            
                                Is generator.next() visible in Python 3?
                            
                                Is it not possible to define multiple constructors in Python? [duplicate]
                            
                                Error message: "'chromedriver' executable needs to be available in the path"
                            
                                How to execute raw SQL in Flask-SQLAlchemy app
                            
                                Apply pandas function to column to create multiple new columns?
                            
                                What's the u prefix in a Python string?
                            
                                BeautifulSoup getting href [duplicate]
                            
                                How does the "view" method work in PyTorch?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting text from HTML file using Python

Tags:

python

html

text

html-content-extraction

People also ask

Recent Activity

Donate For Us