Python: efficient method to replace accents (é to e), remove [^a-zA-Z\d\s], and lower() [duplicate]

Tags:

python

regex

Using Python 3.3. I want to do the following:

replace special alphabetical characters such as e acute (é) and o circumflex (ô) with the base character (ô to o, for example)
remove all characters except alphanumeric and spaces in between alphanumeric characters
convert to lowercase

This is what I have so far:

mystring_modified = mystring.replace('\u00E9', 'e').replace('\u00F4', 'o').lower()
alphnumspace = re.compile(r"[^a-zA-Z\d\s]")
mystring_modified = alphnumspace.sub('', mystring_modified)

How can I improve this? Efficiency is a big concern, especially since I am currently performing the operations inside a loop:

# Pseudocode
for mystring in myfile:
    mystring_modified = # operations described above
    mylist.append(mystring_modified)

The files in question are about 200,000 characters each.

788

asked Mar 07 '13 02:03

oyra

1 Answers

>>> import unicodedata
>>> s='éô'
>>> ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
'eo'

Also check out unidecode

What Unidecode provides is a middle road: function unidecode() takes Unicode data and tries to represent it in ASCII characters (i.e., the universally displayable characters between 0x00 and 0x7F), where the compromises taken when mapping between two character sets are chosen to be near what a human with a US keyboard would choose.

The quality of resulting ASCII representation varies. For languages of western origin it should be between perfect and good. On the other hand transliteration (i.e., conveying, in Roman letters, the pronunciation expressed by the text in some other writing system) of languages like Chinese, Japanese or Korean is a very complex issue and this library does not even attempt to address it. It draws the line at context-free character-by-character mapping. So a good rule of thumb is that the further the script you are transliterating is from Latin alphabet, the worse the transliteration will be.

Note that this module generally produces better results than simply stripping accents from characters (which can be done in Python with built-in functions). It is based on hand-tuned character mappings that for example also contain ASCII approximations for symbols and non-Latin alphabets.

171

answered Nov 15 '22 13:11

John La Rooy

Related questions
                            
                                Example when request.POST contain query string in django
                            
                                Django order_by sum of fields
                            
                                python: padding punctuation with white spaces (keeping punctuation)
                            
                                Getting the widget that triggered an Event?
                            
                                Every day,week,month,year in AppEngine cron (python)
                            
                                Django - How to deal with the paths in settings.py on collaborative projects
                            
                                Why does Python's != operator think that arguments are equal and not equal at the same time?
                            
                                Python isinstance() returning error with datetime.date
                            
                                Are there any ways to scramble strings in python?
                            
                                How to write a static python getitem method?
                            
                                Upgrading Python to 2.7 on OSX
                            
                                Iterating through constructor's arguments
                            
                                Only one python program running (like Firefox)?
                            
                                Werkzeug AttributeError: 'module' object has no attribute 'InteractiveInterpreter'
                            
                                deploying python applications
                            
                                python increment ipaddress
                            
                                What is the difference between __set__ and __setattr__ in Python and when should which be used?
                            
                                Django STATIC_URL is not working
                            
                                Iterate a list of tuples
                            
                                How do I access bottle development server from another PC on the LAN?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With