So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...
The thing is, I do a query search on Twitter Search API with this call:
query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)
Then, I call a method (avaliar_pesquisa()) to evaluate the results I've got, based on the tags (or terms) of the input:
dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))
On avaliar_pesquisa(), the following happens:
def avaliar_pesquisa(dados, tags):
    resultados = []
    # Percorre os resultados
    for i in dados['results']
        resultados.append({'texto'          : i['text'],
                           'imagem'         : i['profile_image_url'],
                           'classificacao'  : avaliar_texto(i['text'], tags),
                           'timestamp'      : i['created_at'],
                         })
Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:
def avaliar_texto(texto, tags):
    # Remove accents
    from unicodedata import normalize
    def strip_accents(txt):
        return normalize('NFKD', txt.decode('utf-8'))
    # Split
    texto_split = strip_accents(texto)
    texto_split = texto.lower().split()
    # Remove non-alpha characters
    import re
    pattern = re.compile('[\W_]+')
    texto_aux = []
    for i in texto_split:
        texto_aux.append(pattern.sub('', i))
    texto_split = texto_aux
The split doesn't really matter here.
The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode.
So, I get this error running the application that receives 100 tweets max as answer:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 17: ordinal not in range(128)
For the following text:
Text: Agora o problema é com o speedy. type 'unicode'
Any ideas?
See this page.
The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.
Try return normalize('NFKD', unicode(txt) ). 
This is what I used in my code to discard accents, etc.
text = unicodedata.normalize('NFD', text).encode('ascii','ignore')
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With