Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a file's format from Unicode to ASCII using Python?

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.

What is the best way to convert the entire file format using Python?

like image 430
Ray Avatar asked Oct 06 '08 17:10

Ray


People also ask

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.

How do you convert a string with Unicode encoding to a string of letters in Python?

Method #2 : Using join() + format() + ord() In this, task of substitution in unicode formatted string is done using format() and ord() is used for conversion.

How do you convert a number to ASCII in Python?

chr () is a built-in function in Python that is used to convert the ASCII code into its corresponding character. The parameter passed in the function is a numeric, integer type value. The function returns a character for which the parameter is the ASCII code.

Does Python use Unicode or ASCII?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.


2 Answers

You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.

This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.

>>> title = u"Klüft skräms inför på fédéral électoral große"

is typically converted to

Klft skrms infr p fdral lectoral groe

which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
like image 60
ConroyP Avatar answered Oct 15 '22 01:10

ConroyP


I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.

This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html

Here's a useful quote from the site:

Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding:

> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>

All three of these return the same thing, since the characters in 'Hello' are common to all three encodings.

Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1.

> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'

If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous.

Unicode supports all the common operations such as iteration and splitting. We won't run over them here.

like image 11
Pete Karl II Avatar answered Oct 15 '22 00:10

Pete Karl II