Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use unidecode in python (3.3)

I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode

It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.

I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.

Here's my python

from unidecode import unidecode

def toascii():
    origfile = open(r'C:\log.convert', 'rb')
    convertfile = open(r'C:\log.toascii', 'wb')

    for line in origfile:
        line = unidecode(line)
        convertfile.write(line)

    origfile.close()
    convertfile.close()

toascii();

If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.

If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.

if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.

If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface

If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.

like image 240
BeanBagKing Avatar asked Nov 04 '13 16:11

BeanBagKing


People also ask

How do you use Unidecode in Python?

from unidecode import unidecode def toascii(): origfile = open(r'C:\log. convert', 'rb') convertfile = open(r'C:\log. toascii', 'wb') for line in origfile: line = unidecode(line) convertfile. write(line) origfile.

What is Unicode in Python 3?

In Python3, the default string is called Unicode string (u string), you can understand them as human-readable characters. As explained above, you can encode them to the byte string (b string), and the byte string can be decoded back to the Unicode string.

How do you get the UTF-8 character code in Python?

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point. In Python 2, chr only supports ASCII, so only numbers in the [0.. 255] range.


1 Answers

The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.

Quoting from the module documentation:

The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)

Emphasis mine.

This should work:

def toascii():
    with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
        for line in origfile:
            line = unidecode(line)
            convertfile.write(line)

This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).

You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.

like image 145
Martijn Pieters Avatar answered Sep 26 '22 01:09

Martijn Pieters