Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"UnicodeEncodeError: 'ascii' codec can't encode character"

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:

UnicodeEncodeError: 'ascii' codec can't encode character

I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.

Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?

Thanks! Full error:

E ====================================================================== ERROR: test_untitled (__main__.Untitled) ---------------------------------------------------------------------- Traceback (most recent call last):   File "C:\Python26\Test2.py", line 26, in test_untitled     ofile.write(Whois + '\n') UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128) 

Full Script:

from selenium import selenium import unittest, time, re, csv, logging  class Untitled(unittest.TestCase):     def setUp(self):         self.verificationErrors = []         self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")         self.selenium.start()         self.selenium.set_timeout("90000")      def test_untitled(self):         sel = self.selenium         spamReader = csv.reader(open('SubDomainList.csv', 'rb'))         for row in spamReader:             sel.open(row[0])             time.sleep(10)             Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")             Test = Test.replace(",","")             Test = Test.replace("\n", "")             ofile = open('TestOut.csv', 'ab')             ofile.write(Test + '\n')             ofile.close()      def tearDown(self):         self.selenium.stop()         self.assertEqual([], self.verificationErrors)  if __name__ == "__main__":     unittest.main() 
like image 918
KenBurnsFan1 Avatar asked Oct 31 '09 00:10

KenBurnsFan1


People also ask

How do I fix UnicodeEncodeError in Python?

Only a limited number of Unicode characters are mapped to strings. Thus, any character that is not-represented / mapped will cause the encoding to fail and raise UnicodeEncodeError. To avoid this error use the encode( utf-8 ) and decode( utf-8 ) functions accordingly in your code.

What UTF-8 means?

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character.


1 Answers

You're trying to convert unicode to ascii in "strict" mode:

>>> help(str.encode) Help on method_descriptor:  encode(...)     S.encode([encoding[,errors]]) -> object      Encodes S using the codec registered for encoding. encoding defaults     to the default encoding. errors may be given to set a different error     handling scheme. Default is 'strict' meaning that encoding errors raise     a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and     'xmlcharrefreplace' as well as any other name registered with     codecs.register_error that is able to handle UnicodeEncodeErrors. 

You probably want something like one of the following:

s = u'Protection™'  print s.encode('ascii', 'ignore')    # removes the ™ print s.encode('ascii', 'replace')   # replaces with ? print s.encode('ascii','xmlcharrefreplace') # turn into xml entities print s.encode('ascii', 'strict')    # throw UnicodeEncodeErrors 
like image 82
Seth Avatar answered Sep 24 '22 05:09

Seth