I'm trying to use string.replace('’','')
to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:
SyntaxError: Non-ASCII character '\xe2' in file
EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()
# replace bad characters
raw = raw.replace('’', "")
print(raw)
Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.
Use str.Call str. replace(old, new) with old as "'" and new as "" to remove all single quotes from the string.
To erase Quotes (“”) from a Python string, simply use the replace() command or you can eliminate it if the quotes seem at string ends.
Use str.strip(chars) on str with the quote character '"' as chars to remove quotes from the ends of the string.
The problem here is with the encoding of the file you downloaded (aa_meetings.csv
). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252
. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99'
, which is not what is in the file.
Fixing this is as simple as adding appropriate calls to encode
and decode
:
# encoding: utf-8
import urllib2
# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')
# replace bad characters
raw = raw.replace(u'’', u"'")
print(raw.encode("ascii"))
1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".
This file is encoded in Windows-1252. The apostrophe U+2019
encodes to \x92
in this encoding. The proper thing is to decode the file to Unicode for processing:
data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed
The problem was you were searching for a UTF-8 encoded U+2019
, i.e. \xe2\x80\x99
, which was not in the file. Converting to Unicode solves this.
Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’'
:
Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'
You have to declare the encoding of your source file. Put this as one of the first two lines of your code:
# encoding: utf-8
If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.
You can do string.replace('\xe2', "'")
to replace them with the normal single-quote.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With