Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove a degree symbol from a string using Python

I am using Python to read a text file of data line by line. One of the lines contains a degree symbol. I want to alter this part of the string. My script uses line = line.replace("TEMP [°C]", "TempC"). My code stops at this line but does not change the sting at all nor does it throw an error. Clearly there is something about my replace such that the script does not see the 'TEMP [°C]' as existing in my string.

In order to insert the degree sign in my script I had to change the encoding to UTF-8 in my IDE file settings. I have included the following text at the top of my script.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

How do I replace 'TEMP [°C]' with 'TempC'?

I am using Windows 7 and Python 2.7 with Komodo IDE 5.2

I have tried running the suggested code in a Python Shell in Komodo and that changed the file.

# -*- coding: utf-8 -*-
line = "hello TEMP [°C]"
line = line.replace("TEMP [°C]", "TempC")
print(line)
hello TempC

This suggested code in a Python Shell in Komodo returned this.

line = "TEMP [°C]"
line = line.replace(u"TEMP [°C]", "TempC")
Traceback (most recent call last):
File "<console>", line 0, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 6: ordinal not in range(128)

None of these suggestions worked when reading my text file though.

like image 922
GBG Avatar asked Jan 01 '23 05:01

GBG


1 Answers

Based on your symptoms, your Python str literals end up as their utf-8 encodings, so when you type:

"TEMP [°C]"

you actually get:

'TEMP [\xc2\xb0C]'

Your file is some other encoding (e.g. latin-1 or cp1252), and since you're reading it via plain open, you're getting back undecoded str. But in latin-1 and cp1252 encoding, the str is 'TEMP [\xb0C]' (note lack of \xc2), so str comparison doesn't consider the two strings equivalent.

The best fix is to replace your use of open with io.open, which uses the Python 3 version of open that can seamlessly decode using a given encoding to produce canonical unicode representations, and similarly, to use unicode literals instead of str in (to Python) unknown encoding, so there is no disagreement on the correct way to represent a degree symbol (in unicode, there is one, and only one, representation):

import io

with io.open('myfile.txt', encoding='cp1252') as f:
    for line in f:
        line = line.replace(u"TEMP [°C]", u"TempC")

As you describe in your edits, your file is likely cp1252 (your editor says it's ANSI, which is just a dumb way to describe cp1252), thus the chosen encoding.

Note: If you're going to use unicode consistently throughout your program (a decent idea if you deal with non-ASCII data), you can make that the default:

from __future__ import unicode_literals
# All string literals are unicode literals unless prefixed with b, as on Python 2

from io import open  # open is now Python 3's open

# No need to qualify with `io.` for `open`, nor put `u` in front of Unicode text
with open('myfile.txt', encoding='cp1252') as f:
    for line in f:
        line = line.replace("TEMP [°C]", "TempC")

Really you should just move to Python 3, where this whole "unicode and str try to work together and often fail" thing was resolved by splitting the two types completely.

like image 92
ShadowRanger Avatar answered Jan 04 '23 02:01

ShadowRanger