Does anyone know an easy way in Python to convert a string with HTML entity codes (e.g. <code>&lt;</code> <code>&amp;</code>) to a normal string (e.g. < &)? <code>cgi.escape()</code> will escape strings (poorly), but there is no <code>unescape()</code>.

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented: (Python2 Docs) <pre class="prettyprint"><code>>>> import HTMLParser >>> h= HTMLParser.HTMLParser() >>> h.unescape('alpha &lt; &beta;') u'alpha < \u03b2' </code></pre> (Python 3 Docs) <pre class="prettyprint"><code>>>> import html.parser >>> h = html.parser.HTMLParser() >>> h.unescape('alpha &lt; &beta;') 'alpha < \u03b2' </code></pre> htmlentitydefs is documented, but requires you to do a lot of the work yourself. If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

I forgot to tag it at first, but I'm using BeautifulSoup. Digging around in the documentation, I found: <pre class="prettyprint"><code>soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES) </code></pre> does it exactly as I was hoping.

HTML Entity Codes to Text [duplicate]

Q: What is HTML &GT?

&gt; and &lt; is a character entity reference for the > and < character in HTML. It is not possible to use the less than (<) or greater than (>) signs in your file, because the browser will mix them with tags. for these difficulties you can use entity names( &gt; ) and entity numbers( &#60; ).

2 Answers

HTMLParser has the functionality in the standard library. It is, unfortunately, undocumented:

(Python2 Docs)

>>> import HTMLParser
>>> h= HTMLParser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
u'alpha < \u03b2'

(Python 3 Docs)

>>> import html.parser
>>> h = html.parser.HTMLParser()
>>> h.unescape('alpha &lt; &beta;')
'alpha < \u03b2'

htmlentitydefs is documented, but requires you to do a lot of the work yourself.

If you only need the XML predefined entities (lt, gt, amp, quot, apos), you could use minidom to parse them. If you only need the predefined entities and no numeric character references, you could even just use a plain old string replace for speed.

108

answered Sep 27 '22 19:09

bobince

I forgot to tag it at first, but I'm using BeautifulSoup.

Digging around in the documentation, I found:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

does it exactly as I was hoping.

answered Sep 27 '22 20:09

tghw

Related questions
                            
                                I/O error(socket error): [Errno 111] Connection refused
                            
                                what does the '~' mean in python? [duplicate]
                            
                                Explain polymorphism
                            
                                remove unwanted space in between a string [duplicate]
                            
                                Using python to run another program?
                            
                                Python PIL For Loop to work with Multi-image TIFF
                            
                                PyQt and context menu
                            
                                Why allow concatenation of string literals?
                            
                                find row or column containing maximum value in numpy array
                            
                                Normalization to bring in the range of [0,1]
                            
                                Count lines of code in directory using Python
                            
                                How to install PIL on Spyder(Anaconda 3)?
                            
                                Computing cosine similarity between two tensors in Keras
                            
                                Properly using subprocess.PIPE in python?
                            
                                How to take input in an array + PYTHON? [duplicate]
                            
                                Restricting the value in Tkinter Entry widget
                            
                                FreqDist in NLTK not sorting output
                            
                                How to make Scrapy show user agent per download request in log?
                            
                                check if query exists using peewee
                            
                                PyTorch NotImplementedError in forward

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HTML Entity Codes to Text [duplicate]

Tags:

python

html

beautifulsoup

tghw

People also ask

2 Answers

bobince

tghw

Recent Activity

Donate For Us