Use html5lib to convert an HTML fragment to plain text

Question

Is there an easy way to use the Python library html5lib to convert something like this:

<p>Hello World. Greetings from <strong>Mars.</strong></p>

to

Hello World. Greetings from Mars.

Niklas B. · Accepted Answer

With lxml as the parser backend:

import html5lib

body = "<p>Hello World. Greetings from <strong>Mars.</strong></p>"
doc = html5lib.parse(body, treebuilder="lxml")
print doc.text_content()

To be honest, this is actually cheating, as it is equivalent to the following (only the relevant parts are changed):

from lxml import html
doc = html.fromstring(body)
print doc.text_content()

If you really want the html5lib parsing engine:

from lxml.html import html5parser
doc = html5parser.fromstring(body)
print doc.xpath("string()")

seddonym · Answer

I use html2text, which converts it to plain text (in Markdown format).

from html2text import HTML2Text
handler = HTML2Text()

html = """Lorem <i>ipsum</i> dolor sit amet, <b>consectetur adipiscing</b> elit.<br>
          <br><h1>Nullam eget 
gravida elit</h1>Integer iaculis elit at risus feugiat:
          <br><br><ul><li>Egestas non quis 
lorem.</li><li>Nam id lobortis felis.
          </li><li>Sed tincidunt nulla.</li></ul>
          At massa tempus, quis 
vehicula odio laoreet.<br>"""
text = handler.handle(html)

>>> text
u'Lorem _ipsum_ dolor sit amet, **consectetur adipiscing** elit.

  

# Nullam eget gravida elit

Integer iaculis elit at risus feugiat:

  

  * Egestas non quis lorem.
  * Nam id lobortis felis.
  * Sed tincidunt nulla.
At massa tempus, quis vehicula odio laoreet.

'

Use html5lib to convert an HTML fragment to plain text

Tags:

python

html

html5lib

Jason Christa

2 Answers

Niklas B.

seddonym

Recent Activity

Donate For Us

Use html5lib to convert an HTML fragment to plain text

Tags:

python

html

html5lib

Jason Christa

2 Answers

Niklas B.

seddonym

Related questions

Recent Activity

Donate For Us