I have a list of html pages which may contain certain encoded characters. Some examples are as below - <pre class="prettyprint"><code><a href="mailto:lad%20at%20maestro%20dot%20com"> ada&#x40;graphics.maestro.com mel&#x40;graphics.maestro.com </code></pre> I would like to decode (escape, I'm unsure of the current terminology) these strings to - <pre class="prettyprint"><code> <a href="mailto:lad at maestro dot com"> ada@graphics.maestro.com mel@graphics.maestro.com </code></pre> Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok. Edit - The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a <pre class="prettyprint"><code>UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128) </code></pre> error in some cases.

You need to unescape HTML entities, and URL-unquote. The standard library has <code>HTMLParser</code> and <code>urllib2</code> to help with those tasks. <pre class="prettyprint"><code>import HTMLParser, urllib2 markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com"> ada&#x40;graphics.maestro.com mel&#x40;graphics.maestro.com''' result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup)) for line in result.split("\n"): print(line) </code></pre> Result: <pre class="prettyprint"><code><a href="mailto:lad at maestro dot com"> ada@graphics.maestro.com mel@graphics.maestro.com </code></pre> <hr> Edit: If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output. The sample file you uploaded has charset set to <code>cp-1252</code>, so let's try decoding from that to Unicode: <pre class="prettyprint"><code>import codecs with codecs.open(filename, encoding="cp1252") as fin: decoded = fin.read() result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded)) with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou: fou.write(result) </code></pre> <hr> Edit2: If you don't care about the non-ASCII characters you can simplify a bit: <pre class="prettyprint"><code>with open(filename) as fin: decoded = fin.read().decode('ascii','ignore') ... </code></pre>

Encode Decode of strings python

I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.

How do I decode a UTF-8 string in Python?

Use bytes.decode(encoding) with encoding as "utf8" to decode a UTF-8-encoded byte string bytes .

What is encode/decode in Python?

Practical Data Science using PythonTo represent a unicode string as a string of bytes is known as encoding. To convert a string of bytes to a unicode string is known as decoding.

How do you encode strings in Python?

Python String encode() Method The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

How do you decode a string?

Decode String - LeetCode. Given an encoded string, return its decoded string. The encoding rule is: k[encoded_string] , where the encoded_string inside the square brackets is being repeated exactly k times. Note that k is guaranteed to be a positive integer.

You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.

import HTMLParser, urllib2

markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>'''

result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"): 
    print(line)

Result:

<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

import codecs 
with codecs.open(filename, encoding="cp1252") as fin:
    decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
    fou.write(result)

Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:

with open(filename) as fin:
    decoded = fin.read().decode('ascii','ignore')
...

Encode Decode of strings python

Tags:

python

encode

character-encoding

decode

Dexter

People also ask

1 Answers

mechanical_meat

Recent Activity

Donate For Us

Encode Decode of strings python

Tags:

python

encode

character-encoding

decode

Dexter

People also ask

1 Answers

mechanical_meat

Related questions

Recent Activity

Donate For Us