"UnicodeEncodeError: 'ascii' codec can't encode character"

Tags:

I'm trying to pass big strings of random html through regular expressions and my Python 2.6 script is choking on this:

UnicodeEncodeError: 'ascii' codec can't encode character

I traced it back to a trademark superscript on the end of this word: Protection™ -- and I expect to encounter others like it in the future.

Is there a module to process non-ascii characters? or, what is the best way to handle/escape non-ascii stuff in python?

Thanks! Full error:

E ====================================================================== ERROR: test_untitled (__main__.Untitled) ---------------------------------------------------------------------- Traceback (most recent call last):   File "C:\Python26\Test2.py", line 26, in test_untitled     ofile.write(Whois + '\n') UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 1005: ordinal not in range(128)

Full Script:

from selenium import selenium import unittest, time, re, csv, logging  class Untitled(unittest.TestCase):     def setUp(self):         self.verificationErrors = []         self.selenium = selenium("localhost", 4444, "*firefox", "http://www.BaseDomain.com/")         self.selenium.start()         self.selenium.set_timeout("90000")      def test_untitled(self):         sel = self.selenium         spamReader = csv.reader(open('SubDomainList.csv', 'rb'))         for row in spamReader:             sel.open(row[0])             time.sleep(10)             Test = sel.get_text("//html/body/div/table/tbody/tr/td/form/div/table/tbody/tr[7]/td")             Test = Test.replace(",","")             Test = Test.replace("\n", "")             ofile = open('TestOut.csv', 'ab')             ofile.write(Test + '\n')             ofile.close()      def tearDown(self):         self.selenium.stop()         self.assertEqual([], self.verificationErrors)  if __name__ == "__main__":     unittest.main()

918

asked Oct 31 '09 00:10

KenBurnsFan1

1 Answers

You're trying to convert unicode to ascii in "strict" mode:

>>> help(str.encode) Help on method_descriptor:  encode(...)     S.encode([encoding[,errors]]) -> object      Encodes S using the codec registered for encoding. encoding defaults     to the default encoding. errors may be given to set a different error     handling scheme. Default is 'strict' meaning that encoding errors raise     a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and     'xmlcharrefreplace' as well as any other name registered with     codecs.register_error that is able to handle UnicodeEncodeErrors.

You probably want something like one of the following:

s = u'Protection™'  print s.encode('ascii', 'ignore')    # removes the ™ print s.encode('ascii', 'replace')   # replaces with ? print s.encode('ascii','xmlcharrefreplace') # turn into xml entities print s.encode('ascii', 'strict')    # throw UnicodeEncodeErrors

answered Sep 24 '22 05:09

Seth

Related questions
                            
                                Regular Expression extract first three characters from a string
                            
                                How to replace in WebStorm/PhpStorm with regex
                            
                                Confused about backslashes in regular expressions [duplicate]
                            
                                Python: use regular expression to remove the white space from all lines
                            
                                Nginx - Rewrite the request_uri before uwsgi_pass
                            
                                re.findall behaves weird
                            
                                Multiline Regular Expression search and replace!
                            
                                Sublime regex replace merging replace text with capture group
                            
                                Emacs query-replace-regexp multiline
                            
                                How to replace all BUT the first occurrence of a pattern in string
                            
                                trim in javascript ? what this code is doing?
                            
                                Why does strsplit use positive lookahead and lookbehind assertion matches differently?
                            
                                @Pattern for alphanumeric string - Bean validation
                            
                                Is it possible to match nested brackets with a regex without using recursion or balancing groups?
                            
                                How to remove square brackets and anything between them with a regex?
                            
                                How do you replace double quotes with a blank space in Java?
                            
                                grep backslash in negative lookbehind
                            
                                regex match either string in linux "find" command
                            
                                VSCode wildcard Search and Replace Regex
                            
                                Adding Line Break After pattern in VIM

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"UnicodeEncodeError: 'ascii' codec can't encode character"

Tags:

regex

unicode

python-2.6

non-ascii-characters

KenBurnsFan1

People also ask

1 Answers

Seth

Recent Activity

Donate For Us