I have a list of html pages which may contain certain encoded characters. Some examples are as below -
<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>
I would like to decode (escape, I'm unsure of the current terminology) these strings to -
<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>
Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.
Edit -
The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)
error in some cases.
Use bytes.decode(encoding) with encoding as "utf8" to decode a UTF-8-encoded byte string bytes .
Practical Data Science using PythonTo represent a unicode string as a string of bytes is known as encoding. To convert a string of bytes to a unicode string is known as decoding.
Python String encode() Method The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
Decode String - LeetCode. Given an encoded string, return its decoded string. The encoding rule is: k[encoded_string] , where the encoded_string inside the square brackets is being repeated exactly k times. Note that k is guaranteed to be a positive integer.
You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser
and urllib2
to help with those tasks.
import HTMLParser, urllib2
markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada@graphics.maestro.com</em>
<em>mel@graphics.maestro.com</em>'''
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"):
print(line)
Result:
<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>
Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252
, so let's try decoding from that to Unicode:
import codecs
with codecs.open(filename, encoding="cp1252") as fin:
decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
fou.write(result)
Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:
with open(filename) as fin:
decoded = fin.read().decode('ascii','ignore')
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With