Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

printing html entities using lxml in python

I'm trying to make a div element from the below string with html entities. Since my string contains html entities, & reserved char in the html entity is being escaped as & in the output. Thus html entities are displayed as plain text. How can I avoid this so html entities are rendered properly?

s = 'Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources'

div = etree.Element("div")
div.text = s

lxml.html.tostring(div)

output:
<div>Actress Adamari L&amp;#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&amp;#8482; Website And Resources</div>
like image 840
ravi Avatar asked Sep 30 '22 06:09

ravi


1 Answers

You can specify encoding while calling tostring():

>>> from lxml.html import fromstring, tostring
>>> s = 'Actress Adamari L&#243;pez And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts&#8482; Website And Resources'
>>> div = fromstring(s)
>>> print tostring(div, encoding='unicode')
<p>Actress Adamari López And Amgen Launch Spanish-Language Chemotherapy: Myths Or Facts™ Website And Resources</p>

As a side note, you should definitely use lxml.html.tostring() while dealing with HTML data:

Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like <script src="..."></script> will be serialized as <script src="..." />, which completely confuses browsers.

Also see:

  • Serialising to Unicode strings
like image 172
alecxe Avatar answered Oct 03 '22 01:10

alecxe