Python encoding/decoding problems

Tags:

How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.

So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

â€œThings
“Things
Things

But I actually want to get "Things.

or:

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

werenâ€™t
weren’t
werent

But I actually want to get weren't.

How should i do this?

717

asked Jan 17 '15 05:01

Brana

1 Answers

I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.

This function is by no means ideal,but it is the best place to start with. There are more chars definitions:

http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

...

def unicodetoascii(text):

    uni2ascii = {
            ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
            ord('\xc3\xa9'.decode('utf-8')): ord('e'),
            ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
            ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),

            ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
            ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),

            ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
            ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),

            ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
            ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
            ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
            ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
            ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),

                            }
    return text.decode('utf-8').translate(uni2ascii).encode('ascii')

print unicodetoascii("weren\xe2\x80\x99t")

answered Sep 20 '22 13:09

Brana

Related questions
                            
                                Running C in A Browser
                            
                                SyntaxError when trying to use backslash for Windows file path
                            
                                Change permissions via ftp in python
                            
                                Python Error : unsupported operand type(s) for +: 'int' and 'datetime.timedelta'
                            
                                How do I run the louvain community detection algorithm in igraph?
                            
                                np.arange followed by reshape
                            
                                How to set and retrieve environment variable in Python
                            
                                OrderedDict does not preserve the order
                            
                                Incrementing a for loop, inside the loop
                            
                                How do I 'check' a radio button value using django RadioSelect widget
                            
                                Python 3 backward compatability (shlex.quote vs pipes.quote)
                            
                                How can I safely check if a python package is outdated?
                            
                                Trouble installing scikit-bio on Windows
                            
                                Shifting an image in numpy
                            
                                Why isn't range getting exhausted in Python-3?
                            
                                How to tell when a method is called for first time of many
                            
                                Fastest way to check does string contain any word from list
                            
                                Idiomatically negate a filter
                            
                                How to subset a data frame using Pandas based on a group criteria?
                            
                                django run localhost from another computer connected to another network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python encoding/decoding problems

Tags:

python

encoding

ascii

python-2.7

non-ascii-characters

Brana

People also ask

1 Answers

Brana

Recent Activity

Donate For Us