Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting html to text with Python

I am trying to convert an html block to text using Python.

Input:

<div class="body"><p><strong></strong></p> <p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p> <p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p> <p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div> 

Desired output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa

Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa

I tried the html2text module without much success:

#!/usr/bin/env python  import urllib2 import html2text from BeautifulSoup import BeautifulSoup  soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())  txt = soup.find('div', {'class' : 'body'})  print(html2text.html2text(txt)) 

The txt object produces the html block above. I'd like to convert it to text and print it on the screen.

like image 916
Aaron Bandelli Avatar asked Feb 04 '13 19:02

Aaron Bandelli


People also ask

How do you convert HTML to text in Python?

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Escape all special characters. Output is less readable, but avoids corner case formatting issues.


1 Answers

soup.get_text() outputs what you want:

from bs4 import BeautifulSoup soup = BeautifulSoup(html) print(soup.get_text()) 

output:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa 

To keep newlines:

print(soup.get_text('\n')) 

To be identical to your example, you can replace a newline with two newlines:

soup.get_text().replace('\n','\n\n') 
like image 113
root Avatar answered Oct 05 '22 15:10

root