Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rendered HTML to plain text using Python

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>     <p>         Some text         <span>more text</span>         even more text     </p>     <ul>         <li>list item</li>         <li>yet another list item</li>     </ul> </div> <p>Some other text</p> <ul>     <li>list item</li>     <li>yet another list item</li> </ul> 

I tried doing something like:

def parse_text(contents_string)     Newlines = re.compile(r'[\r\n]\s+')     bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)     txt = bs.getText('\n')     return Newlines.sub('\n', txt) 

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

like image 799
btatarov Avatar asked Nov 12 '12 02:11

btatarov


People also ask

How do I get text from HTML in Python?

To extract text from HTML file using Python, we can use BeautifulSoup. We call urllib. request. urlopen with the url we want to get the HTML text from.


2 Answers

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text html = open("foobar.html").read() print html2text.html2text(html) 

This outputs:

 Some text more text even more text    * list item   * yet another list item  Some other text    * list item   * yet another list item 
like image 190
del Avatar answered Sep 30 '22 09:09

del


I was encountering the same problem trying to parse the rendered HTML. Basically it seems that BS is not the ideal package for this. @Del gives the great html2text solution.

On a differet SO question: BeautifulSoup get_text does not strip all tags and JavaScript @Helge mentioned using nltk. Unfortunately nltk appears to be discontinuing this method.

I tried both html2text and nltk.clean_html and was surprised by the timing results so thought they warranted an answer for posterity. Of course, the speeds highly depend on the contents of the data...

Answer from @Helge (nltk).

import nltk  %timeit nltk.clean_html(html) was returning 153 us per loop 

It worked really well to return a string with rendered html. This nltk module was faster than even html2text, though perhaps html2text is more robust.

Answer above from @del

betterHTML = html.decode(errors='ignore') %timeit html2text.html2text(betterHTML) %3.09 ms per loop 
like image 31
Paul Avatar answered Sep 30 '22 09:09

Paul