Replacing a weird single-quote (’) with blank string in Python

Tags:

python

I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:

SyntaxError: Non-ASCII character '\xe2' in file

EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.

# encoding: utf-8

import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()

# replace bad characters
raw = raw.replace('’', "")

print(raw)

Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.

248

asked Sep 13 '11 01:09

Gady

4 Answers

The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII¹ octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.

Fixing this is as simple as adding appropriate calls to encode and decode:

# encoding: utf-8
import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')

# replace bad characters
raw = raw.replace(u'’', u"'")

print(raw.encode("ascii"))

¹ by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".

118

answered Oct 20 '22 09:10

zwol

This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:

data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed

The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.

Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':

Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'

answered Oct 20 '22 10:10

Josh Lee

You have to declare the encoding of your source file. Put this as one of the first two lines of your code:

# encoding: utf-8

If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.

answered Oct 20 '22 09:10

Roberto Bonvallet

You can do string.replace('\xe2', "'") to replace them with the normal single-quote.

answered Oct 20 '22 09:10

Ethan Furman

Related questions
                            
                                Now that Python 2.6 is out, what modules currently in the language should every programmer know about?
                            
                                Python Generator - what not to use it for [closed]
                            
                                Create properties using lambda getter and setter
                            
                                Is 'for x in array' always result in sorted x? [Python/NumPy]
                            
                                Delete None values from Python dict
                            
                                Django + MySQL on Mac OS 10.6.2 Snow Leopard
                            
                                What is the Python equivalent of Perl's FindBin? [duplicate]
                            
                                Looking forward to a programming future but confused where to start [closed]
                            
                                Adding printf to the starting of all functions in a file
                            
                                Splitting a 16 bit int into two 8 bit ints in python
                            
                                Any way to keep track of the last 5 data points in python
                            
                                Expunge object from SQLAlchemy session
                            
                                most negative value for python
                            
                                How do I make my wxpython top frame show in the middle of my desktop?
                            
                                Never use reflection in production code! What about Python?
                            
                                Calculating EuropeanOptionImpliedVolatility in quantlib-python
                            
                                Broadcasting a python function on to numpy arrays
                            
                                What does Python 3.2 "with/as" do
                            
                                how to return a dictionary in python django and view it in javascript?
                            
                                Python: elif or new if?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replacing a weird single-quote (’) with blank string in Python

Tags:

python

Gady

People also ask

4 Answers

zwol

Josh Lee

Roberto Bonvallet

Ethan Furman

Recent Activity

Donate For Us