I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format.
What is the best way to convert the entire file format using Python?
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.
Method #2 : Using join() + format() + ord() In this, task of substitution in unicode formatted string is done using format() and ord() is used for conversion.
chr () is a built-in function in Python that is used to convert the ASCII code into its corresponding character. The parameter passed in the function is a numeric, integer type value. The function returns a character for which the parameter is the ASCII code.
Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.
You can convert the file easily enough just using the unicode
function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.
This blog recommends the unicodedata
module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.
>>> title = u"Klüft skräms inför på fédéral électoral große"
is typically converted to
Klft skrms infr p fdral lectoral groe
which is pretty wrong. However, using the unicodedata
module, the result can be much closer to the original text:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.
This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html
Here's a useful quote from the site:
Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding:
> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>
All three of these return the same thing, since the characters in 'Hello' are common to all three encodings.
Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1.
> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'
If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous.
Unicode supports all the common operations such as iteration and splitting. We won't run over them here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With