Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem.

I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').

u-umlaut has unicode code point 252, so I tried this:

>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'

I expected the last string to be u'ueber'.

What I ultimately want to do is replace all u-umlauts in a file with 'ue':

import sys
import codecs      
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f: 
    print repr(line).replace(unichr(252), 'ue')

Thanks for your help! (I'm using Python 2.3.)

like image 683
Frank Avatar asked Jan 13 '10 05:01

Frank


3 Answers

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.

You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.

If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:

repr(str.replace(unichr(252), 'ue'))
like image 25
Brian Campbell Avatar answered Oct 01 '22 18:10

Brian Campbell


I would define a dictionary of special characters (that I want to map) then I use translate method.

line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'

spcial_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(spcial_char_map))

you will get the following result:

Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.
like image 162
Amin Kiany Avatar answered Oct 01 '22 16:10

Amin Kiany


You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.

I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.

# coding: ascii

translations = (
    (u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
    (u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
    # et cetera
    )

test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'

out = test
for from_str, to_str in translations:
    out = out.replace(from_str, to_str)
print out

output:

Moeller von Muenchen
like image 38
John Machin Avatar answered Oct 01 '22 17:10

John Machin