Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

I'm stuck here trying to unescape HTML special characters.

The problematic text is

Rudimental & Emeli Sandé

which should be converted to Rudimental & Emeli Sandé

The text is downloaded via WGET (outside of python)

To test this, save a ANSI file with this line and import it.

import HTMLParser

trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()

track = html_parser.unescape(track)

print(track)

I get this error when a line has é in it.

*pi@raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
  File "unparse.py", line 9, in <module>
    track = html_parser.unescape(track)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*

The same code works fine under windows - I only have problems on the raspberry pi running Python 2.7.3.

like image 882
576i Avatar asked Jan 24 '14 21:01

576i


1 Answers

Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.

Your problem (condensed):

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sand\xe9'
output = parser.unescape(input)

produces

Traceback (most recent call last):
  File "problem.py", line 4, in <module>
    output = parser.unescape(input)
  File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)

HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.

There are two easy solutions. Either you do the conversion manually:

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)

or you use codecs.open() instead of open() whenever you are working with files:

import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)
like image 81
Yurim Avatar answered Nov 15 '22 09:11

Yurim