I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format. What is the best way to convert the entire file format using Python?

You can convert the file easily enough just using the <code>unicode</code> function, but you'll run into problems with Unicode characters without a straight ASCII equivalent. This blog recommends the <code>unicodedata </code> module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g. <pre class="prettyprint"><code>>>> title = u"Klüft skräms inför på fédéral électoral große" </code></pre> is typically converted to <pre class="prettyprint"><code>Klft skrms infr p fdral lectoral groe </code></pre> which is pretty wrong. However, using the <code>unicodedata</code> module, the result can be much closer to the original text: <pre class="prettyprint"><code>>>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe' </code></pre>

I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another. This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html Here's a useful quote from the site: <blockquote> Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding: </blockquote> <pre class="prettyprint"><code>> >>> unicode('hello') u'hello' > >>> unicode('hello', 'ascii') u'hello' > >>> unicode('hello', 'iso-8859-1') u'hello' > >>> </code></pre> <blockquote> All three of these return the same thing, since the characters in 'Hello' are common to all three encodings. Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1. </blockquote> <pre class="prettyprint"><code>> >>> a = unicode('André','latin-1') > >>> a u'Andr\202' </code></pre> <blockquote> If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous. Unicode supports all the common operations such as iteration and splitting. We won't run over them here. </blockquote>

How do I convert a file's format from Unicode to ASCII using Python?

2 Answers

You can convert the file easily enough just using the unicode function, but you'll run into problems with Unicode characters without a straight ASCII equivalent.

This blog recommends the unicodedata module, which seems to take care of roughly converting characters without direct corresponding ASCII values, e.g.

>>> title = u"Klüft skräms inför på fédéral électoral große"

is typically converted to

Klft skrms infr p fdral lectoral groe

which is pretty wrong. However, using the unicodedata module, the result can be much closer to the original text:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

answered Oct 15 '22 01:10

ConroyP

I think this is a deeper issue than you realize. Simply changing the file from Unicode into ASCII is easy, however, getting all of the Unicode characters to translate into reasonable ASCII counterparts (many letters are not available in both encodings) is another.

This Python Unicode tutorial may give you a better idea of what happens to Unicode strings that are translated to ASCII: http://www.reportlab.com/i18n/python_unicode_tutorial.html

Here's a useful quote from the site:

Python 1.6 also gets a "unicode" built-in function, to which you can specify the encoding:

> >>> unicode('hello') u'hello'
> >>> unicode('hello', 'ascii') u'hello'
> >>> unicode('hello', 'iso-8859-1') u'hello'
> >>>

All three of these return the same thing, since the characters in 'Hello' are common to all three encodings.

Now let's encode something with a European accent, which is outside of ASCII. What you see at a console may depend on your operating system locale; Windows lets me type in ISO-Latin-1.

> >>> a = unicode('André','latin-1')
> >>> a u'Andr\202'

If you can't type an acute letter e, you can enter the string 'Andr\202', which is unambiguous.

Unicode supports all the common operations such as iteration and splitting. We won't run over them here.

answered Oct 15 '22 00:10

Pete Karl II

Related questions
                            
                                using the 'is' keyword in a switch in c#
                            
                                PyQt4 Minimize to Tray
                            
                                Is it possible to make a parameter implement two interfaces?
                            
                                AddEventHandler using reflection
                            
                                Adding Values From Tuples of Same Length
                            
                                Is ExtJS open source? [closed]
                            
                                Running a JAR file without directly calling `java`
                            
                                Is it possible to Stop jqGrid row(s) from being selected and/or highlighted?
                            
                                Django user impersonation by admin
                            
                                Keep iPhone UIButton Highlighted
                            
                                C# DataRow Empty-check
                            
                                What are the Difference between cElementtree and ElementTree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I convert a file's format from Unicode to ASCII using Python?

Tags:

python

file

encoding

unicode

ascii

Ray

People also ask

2 Answers

ConroyP

Pete Karl II

Recent Activity

Donate For Us