Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python's string.maketrans works at home but fails on Google App Engine

I have this code in Google AppEngine (Python SDK):

from string import maketrans 

intab =  u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ".encode('latin1') 
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn".encode('latin1') 
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)

When I run the code in the interactive console I have no problem, but when I try it in GAE I get the following error:

raise ValueError, "maketrans arguments must have same length" ValueError: maketrans arguments must have same length INFO 2009-12-03 20:04:02,904 dev_appserver.py:3038] "POST /backendsavenew HTTP/1.1" 500 - INFO 2009-12-03 20:08:37,649 admin.py:112] 106 INFO 2009-12-03 20:08:37,651 admin.py:113] 53 ERROR 2009-12-03 20:08:37,653 init.py:388] maketrans arguments must have same length

I can't figure out why the intab it's doubled in size. The python file with the code is saved as UTF-8.

Thanks in advance for any help.

like image 648
Juan E. Avatar asked Dec 03 '09 20:12

Juan E.


2 Answers

string.maketrans and string.translate do not work for Unicode strings. Your call to string.maketrans will implictly convert the Unicode you gave it to an encoding like utf-8. In utf-8 å takes up more space than ASCII a. string.maketrans sees len(str(argument)) which is different for your two strings.

There is a Unicode translate, but for your use case (convert Unicode to ASCII because some part of your system cannot deal with Unicode) you should use http://pypi.python.org/pypi/Unidecode. Unidecode is very smart about transliterating Unicode characters to sensible ASCII, covering many more characters than in your example.

You should save your Python code as utf-8, but make sure you add the magic so Python doesn't have to assume you used the system's default encoding. This line should be the first or second line of your Python files:

# -*- coding: utf-8 -*-

There are many advantages to processing text as Unicode instead of binary strings. This is the Unicode way to do what you are trying to do:

intab =  u"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = u"aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
trantab = dict((ord(a), b) for a, b in zip(intab, outtab))
translated = intab.translate(trantab)
translated == outtab # True

See also Where is Python's "best ASCII for this Unicode" database?

See also How do I get str.translate to work with Unicode strings?

like image 131
joeforker Avatar answered Nov 19 '22 21:11

joeforker


Maybe you could use iso-8859-1 encoding for your file instead of utf-8

# -*- coding: iso-8859-1 -*-
from string import maketrans 
import logging

intab =  "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ"
outtab = "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn"
logging.info(len(intab))
logging.info(len(outtab))
trantab = maketrans(intab, outtab)

Remember to select iso-8859-1 in your text editor while saving this python source file.

like image 1
Kamil Szot Avatar answered Nov 19 '22 20:11

Kamil Szot