Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing a weird single-quote (’) with blank string in Python

Tags:

python

I'm trying to use string.replace('’','') to replace the dreaded weird single-quote character: ’ (aka \xe2 aka #8217). But when I run that line of code, I get this error:

SyntaxError: Non-ASCII character '\xe2' in file

EDIT: I get this error when trying to replace characters in a CSV file obtained remotely.

# encoding: utf-8

import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read()

# replace bad characters
raw = raw.replace('’', "")

print(raw)

Even after the above code is executed, the unwanted character still exists in the print result. I tried the suggestions in the below answers as well. Pretty sure it's an encoding issue, but I just don't know how to fix it, so of course any help is much appreciated.

like image 248
Gady Avatar asked Sep 13 '11 01:09

Gady


People also ask

How do you replace a single quote in a string in Python?

Use str.Call str. replace(old, new) with old as "'" and new as "" to remove all single quotes from the string.

How do you replace a quote in Python?

To erase Quotes (“”) from a Python string, simply use the replace() command or you can eliminate it if the quotes seem at string ends.

How do you remove single and double quotes from a string in Python?

Use str.strip(chars) on str with the quote character '"' as chars to remove quotes from the ends of the string.


4 Answers

The problem here is with the encoding of the file you downloaded (aa_meetings.csv). The server doesn't declare an encoding in its HTTP headers, but the only non-ASCII1 octet in the file has the value 0x92. You say that this is supposed to be "the dreaded weird single-quote character", therefore the file's encoding is windows-1252. But you're trying to search and replace for the UTF-8 encoding of U+2019, i.e. '\xe2\x80\x99', which is not what is in the file.

Fixing this is as simple as adding appropriate calls to encode and decode:

# encoding: utf-8
import urllib2

# read raw CSV data from URL
url = urllib2.urlopen('http://www.aaphoenix.org/meetings/aa_meetings.csv')
raw = url.read().decode('windows-1252')

# replace bad characters
raw = raw.replace(u'’', u"'")

print(raw.encode("ascii"))

1 by "ASCII" I mean "the character encoding which maps single octets with values 0x00 through 0x7F directly to U+0000 through U+007F, and does not define the meaning of octets with values 0x80 through 0xFF".

like image 118
zwol Avatar answered Oct 20 '22 09:10

zwol


This file is encoded in Windows-1252. The apostrophe U+2019 encodes to \x92 in this encoding. The proper thing is to decode the file to Unicode for processing:

data = open('aa_meetings.csv').read()
assert '\x92' in data
chars = data.decode('cp1252')
assert u'\u2019' in chars
fixed = chars.replace(u'\u2019', '')
assert u'\u2019' not in fixed

The problem was you were searching for a UTF-8 encoded U+2019, i.e. \xe2\x80\x99, which was not in the file. Converting to Unicode solves this.

Using unicode literals as I have here is an easy way to avoid this mistake. However, you can encode the character directly if you write it as u'’':

Python 2.7.1
>>> u'’'
u'\u2019'
>>> '’'
'\xe2\x80\x99'
like image 44
Josh Lee Avatar answered Oct 20 '22 10:10

Josh Lee


You have to declare the encoding of your source file. Put this as one of the first two lines of your code:

# encoding: utf-8

If you are using an encoding other than UTF-8 (for example Latin-1), you have to put that instead.

like image 26
Roberto Bonvallet Avatar answered Oct 20 '22 09:10

Roberto Bonvallet


You can do string.replace('\xe2', "'") to replace them with the normal single-quote.

like image 2
Ethan Furman Avatar answered Oct 20 '22 09:10

Ethan Furman