Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python encoding problems

So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...

The thing is, I do a query search on Twitter Search API with this call:

query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)

Then, I call a method (avaliar_pesquisa()) to evaluate the results I've got, based on the tags (or terms) of the input:

dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))

On avaliar_pesquisa(), the following happens:

def avaliar_pesquisa(dados, tags):
    resultados = []
    # Percorre os resultados
    for i in dados['results']
        resultados.append({'texto'          : i['text'],
                           'imagem'         : i['profile_image_url'],
                           'classificacao'  : avaliar_texto(i['text'], tags),
                           'timestamp'      : i['created_at'],
                         })

Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:

def avaliar_texto(texto, tags):
    # Remove accents
    from unicodedata import normalize
    def strip_accents(txt):
        return normalize('NFKD', txt.decode('utf-8'))

    # Split
    texto_split = strip_accents(texto)
    texto_split = texto.lower().split()

    # Remove non-alpha characters
    import re
    pattern = re.compile('[\W_]+')
    texto_aux = []
    for i in texto_split:
        texto_aux.append(pattern.sub('', i))
    texto_split = texto_aux

The split doesn't really matter here. The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode. So, I get this error running the application that receives 100 tweets max as answer:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128)

For the following text:

Text: Agora o problema é com o speedy. type 'unicode'

Any ideas?

like image 856
Leonardo Pessoa Avatar asked Dec 06 '22 19:12

Leonardo Pessoa


2 Answers

See this page.

The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.

Try return normalize('NFKD', unicode(txt) ).

like image 200
Russell Borogove Avatar answered Dec 31 '22 13:12

Russell Borogove


This is what I used in my code to discard accents, etc.

text = unicodedata.normalize('NFD', text).encode('ascii','ignore')
like image 20
Ethan Furman Avatar answered Dec 31 '22 15:12

Ethan Furman