How to use unidecode in python (3.3)

Tags:

I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode

It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.

I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.

Here's my python

from unidecode import unidecode

def toascii():
    origfile = open(r'C:\log.convert', 'rb')
    convertfile = open(r'C:\log.toascii', 'wb')

    for line in origfile:
        line = unidecode(line)
        convertfile.write(line)

    origfile.close()
    convertfile.close()

toascii();

If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.

If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.

if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.

If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface

If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.

240

asked Nov 04 '13 16:11

BeanBagKing

1 Answers

The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.

Quoting from the module documentation:

The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)

Emphasis mine.

This should work:

def toascii():
    with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
        for line in origfile:
            line = unidecode(line)
            convertfile.write(line)

This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).

You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.

145

answered Sep 26 '22 01:09

Martijn Pieters

Related questions
                            
                                Drag and Drop in Tkinter?
                            
                                How to instantiate a template method of a template class with swig?
                            
                                How to send JavaScript and Cookies Enabled in Scrapy?
                            
                                Send Apple Notification Service A Message With Python
                            
                                How to physically print python code in color from IDLE?
                            
                                Why is Python 2.7 installed at root, unlike most programs today?
                            
                                Hiding major tick labels while showing minor tick labels in matplotlib
                            
                                Python tkinter label orientation
                            
                                Recursively build hierarchical JSON tree?
                            
                                Dot-slash not recognized in command prompt - Trying to install Python module
                            
                                Python decorator function called at compile time
                            
                                Using compression with Pandas and HD5 / HDFStore
                            
                                What is correct: widget.rowconfigure or widget.grid_rowconfigure?
                            
                                Pass tuple as input argument for scipy.optimize.curve_fit
                            
                                requests: disable auto decoding
                            
                                Windowed maximum in numpy
                            
                                get list of named loglevels
                            
                                Sum over squared array
                            
                                Sorting numpy array on multiple columns in Python
                            
                                dir and help not showing all attributes of an object in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use unidecode in python (3.3)

Tags:

python

encoding

unicode

BeanBagKing

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us