Rendered HTML to plain text using Python

Tags:

beautifulsoup

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>     <p>         Some text         <span>more text</span>         even more text     </p>     <ul>         <li>list item</li>         <li>yet another list item</li>     </ul> </div> <p>Some other text</p> <ul>     <li>list item</li>     <li>yet another list item</li> </ul>

I tried doing something like:

def parse_text(contents_string)     Newlines = re.compile(r'[\r\n]\s+')     bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)     txt = bs.getText('\n')     return Newlines.sub('\n', txt)

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

799

asked Nov 12 '12 02:11

btatarov

2 Answers

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text html = open("foobar.html").read() print html2text.html2text(html)

This outputs:

 Some text more text even more text    * list item   * yet another list item  Some other text    * list item   * yet another list item

190

answered Sep 30 '22 09:09

del

I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk  %timeit nltk.clean_html(html) was returning 153 us per loop

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop

answered Sep 30 '22 09:09

Paul

Related questions
                            
                                What is the fastest template system for Python?
                            
                                How to launch an EDITOR (e. g. vim) from a python script?
                            
                                TypeError: 'int' object does not support indexing
                            
                                View RDD contents in Python Spark?
                            
                                What is a tuple useful for?
                            
                                Edit Distance in Python
                            
                                How to get the range of valid Numpy data types?
                            
                                Merge two objects in Python
                            
                                Plotting grouped data in same plot using Pandas
                            
                                create anaconda python environment with all packages
                            
                                How do I prevent fixtures from conflicting with django post_save signal code?
                            
                                Check if a predicate evaluates true for all elements in an iterable in Python
                            
                                Replace -inf with zero value
                            
                                What is the difference between sets and lists in Python?
                            
                                How to merge two dataframes side-by-side?
                            
                                Convert unicode string dictionary into dictionary in python
                            
                                Simple Python Challenge: Fastest Bitwise XOR on Data Buffers
                            
                                Why is the time complexity of python's list.append() method O(1)?
                            
                                Two variables in Python have same id, but not lists or tuples
                            
                                PyCharm Unresolved reference 'print' [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With