BeautifulSoup Grab Visible Webpage Text

People also ask

How do I get text from Div BeautifulSoup?

BeautifulSoup get text with <br> tags You can use get_text() with an undocumented separator parameter to get the text inside the div like so. Alternatively, you can replace every single <br> tag with an unique string of your choice, then once you get the output, replace that string back to newlines.

Try this:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))

The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.

html = open('21storm.html').read()
soup = BeautifulSoup(html)
[s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()

import urllib
from bs4 import BeautifulSoup

url = "https://www.yahoo.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

I completely respect using Beautiful Soup to get rendered content, but it may not be the ideal package for acquiring the rendered content on a page.

I had a similar problem to get rendered content, or the visible content in a typical browser. In particular I had many perhaps atypical cases to work with such a simple example below. In this case the non displayable tag is nested in a style tag, and is not visible in many browsers that I have checked. Other variations exist such as defining a class tag setting display to none. Then using this class for the div.

<html>
  <title>  Title here</title>

  <body>

    lots of text here <p> <br>
    <h1> even headings </h1>

    <style type="text/css"> 
        <div > this will not be visible </div> 
    </style>


  </body>

</html>

One solution posted above is:

html = Utilities.ReadFile('simple.html')
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)
visible_texts = filter(visible, texts)
print(visible_texts)


[u'\n', u'\n', u'\n\n        lots of text here ', u' ', u'\n', u' even headings ', u'\n', u' this will not be visible ', u'\n', u'\n']

This solution certainly has applications in many cases and does the job quite well generally but in the html posted above it retains the text that is not rendered. After searching SO a couple solutions came up here BeautifulSoup get_text does not strip all tags and JavaScript and here Rendered HTML to plain text using Python

I tried both these solutions: html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

One answer here from @Helge was about using nltk of all things.

import nltk

%timeit nltk.clean_html(html)
was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

betterHTML = html.decode(errors='ignore')
%timeit html2text.html2text(betterHTML)
%3.09 ms per loop

Related questions
                            
                                How to convert an OrderedDict into a regular dict in python3
                            
                                ValueError when checking if variable is None or numpy.array
                            
                                Argparse: Required argument 'y' if 'x' is present
                            
                                Accessing MP3 metadata with Python [closed]
                            
                                How to select rows with NaN in particular column?
                            
                                Python (and Python C API): __new__ versus __init__
                            
                                Understanding inplace=True
                            
                                What are some (concrete) use-cases for metaclasses?
                            
                                Python Pandas replace NaN in one column with value from corresponding row of second column
                            
                                Python: How would you save a simple settings/config file?
                            
                                Pandas cannot open an Excel (.xlsx) file
                            
                                FutureWarning: elementwise comparison failed; returning scalar, but in the future will perform elementwise comparison
                            
                                Make sure only a single instance of a program is running
                            
                                Explicitly select items from a list or tuple
                            
                                __getattr__ on a module
                            
                                How to exit pdb and allow program to continue?
                            
                                In Python script, how do I set PYTHONPATH?
                            
                                Python if-else short-hand [duplicate]
                            
                                How do you fix "runtimeError: package fails to pass a sanity check" for numpy and pandas?
                            
                                How Big can a Python List Get?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup Grab Visible Webpage Text

Tags:

python

text

beautifulsoup

html-content-extraction

People also ask

Recent Activity

Donate For Us