Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace special characters with ASCII equivalent

Tags:

python

unicode

Is there any lib that can replace special characters to ASCII equivalents, like:

"Cześć" 

to:

"Czesc" 

I can of course create map:

{'ś':'s', 'ć': 'c'} 

and use some replace function. But I don't want to hardcode all equivalents into my program, if there is some function that already does that.

like image 394
Tomasz Wysocki Avatar asked Jul 07 '10 12:07

Tomasz Wysocki


People also ask

What is the ASCII code for special characters?

Special Characters(32–47 / 58–64 / 91–96 / 123–126): Special characters include all printable characters that are neither letters nor numbers. These include punctuation or technical, mathematical characters.

How do you find the ASCII equivalent?

If you have the ASCII code for a number you can either subtract 30h or mask off the upper four bits and you will be left with the number itself. Likewise you can generate the ASCII code from the number by adding 30h or by ORing with 30h.

What is ASCII value of A to Z?

Below are the implementation of both methods: Using ASCII values: ASCII value of uppercase alphabets – 65 to 90. ASCII value of lowercase alphabets – 97 to 122.

How do you change ASCII characters?

You must first convert the character to its ASCII value. In LiveCode, this is done with the charToNum function. Converting a number to the corresponding character is done with the numToChar function. The first of these statements converts a number to a character; the second converts a character to its ASCII value.


2 Answers

#!/usr/bin/env python # -*- coding: utf-8 -*-  import unicodedata text = u'Cześć' print unicodedata.normalize('NFD', text).encode('ascii', 'ignore') 
like image 130
nosklo Avatar answered Sep 22 '22 21:09

nosklo


You can get most of the way by doing:

import unicodedata  def strip_accents(text):     return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn') 

Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

  • Æ → AE
  • Ð → D
  • Ø → O
  • Þ → TH
  • ß → ss
  • æ → ae
  • ð → d
  • ø → o
  • þ → th
  • Œ → OE
  • œ → oe
  • ƒ → f
like image 25
dan04 Avatar answered Sep 20 '22 21:09

dan04