Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encode Decode of strings python

I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.

like image 986
Dexter Avatar asked Mar 25 '12 00:03

Dexter


People also ask

How do I decode a UTF-8 string in Python?

Use bytes.decode(encoding) with encoding as "utf8" to decode a UTF-8-encoded byte string bytes .

What is encode/decode in Python?

Practical Data Science using PythonTo represent a unicode string as a string of bytes is known as encoding. To convert a string of bytes to a unicode string is known as decoding.

How do you encode strings in Python?

Python String encode() Method The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

How do you decode a string?

Decode String - LeetCode. Given an encoded string, return its decoded string. The encoding rule is: k[encoded_string] , where the encoded_string inside the square brackets is being repeated exactly k times. Note that k is guaranteed to be a positive integer.


1 Answers

You need to unescape HTML entities, and URL-unquote.
The standard library has HTMLParser and urllib2 to help with those tasks.

import HTMLParser, urllib2

markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>'''

result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
for line in result.split("\n"): 
    print(line)

Result:

<a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Edit:
If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

import codecs 
with codecs.open(filename, encoding="cp1252") as fin:
    decoded = fin.read()
result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
    fou.write(result)

Edit2:
If you don't care about the non-ASCII characters you can simplify a bit:

with open(filename) as fin:
    decoded = fin.read().decode('ascii','ignore')
...
like image 58
mechanical_meat Avatar answered Oct 18 '22 20:10

mechanical_meat