I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:
UnicodeEncodeError: 'ascii' codec can't encode character
I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.
Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?
Thanks! Full error:
E ====================================================================== ERROR: test_untitled (__main__.Untitled) ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\Python26\Test2.py", line 26, in test_untitled ofile.write(Whois + '\n') UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)
Full Script:
from selenium import selenium import unittest, time, re, csv, logging class Untitled(unittest.TestCase): def setUp(self): self.verificationErrors = [] self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/") self.selenium.start() self.selenium.set_timeout("90000") def test_untitled(self): sel = self.selenium spamReader = csv.reader(open('SubDomainList.csv', 'rb')) for row in spamReader: sel.open(row[0]) time.sleep(10) Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td") Test = Test.replace(",","") Test = Test.replace("\n", "") ofile = open('TestOut.csv', 'ab') ofile.write(Test + '\n') ofile.close() def tearDown(self): self.selenium.stop() self.assertEqual([], self.verificationErrors) if __name__ == "__main__": unittest.main()
Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.
UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.
You're trying to convert unicode to ascii in "strict" mode:
>>> help(str.encode) Help on method_descriptor: encode(...) S.encode([encoding[,errors]]) -> object Encodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and 'xmlcharrefreplace' as well as any other name registered with codecs.register_error that is able to handle UnicodeEncodeErrors.
You probably want something like one of the following:
s = u'Protection™' print s.encode('ascii', 'ignore') # removes the ™ print s.encode('ascii', 'replace') # replaces with ? print s.encode('ascii','xmlcharrefreplace') # turn into xml entities print s.encode('ascii', 'strict') # throw UnicodeEncodeErrors
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With