Convert UTF-8 to string literals in Python

Tags:

I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

My string is: 'Entre\xc3\xa9'

Example one:

This code:

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

returns the result: u'Entre\xe9'

If I then continue by printing this:

print u'Entre\xe9'

I get the result: Entreé

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

Example:

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

I would like result of "c" to be:

Entreé

400

asked Jul 04 '14 10:07

Tminer

1 Answers

The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.

You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().

Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

Your a value is defined using a byte string literal, so you only need to decode:

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder

101

answered Sep 22 '22 22:09

Martijn Pieters

Related questions
                            
                                python regex: get name of named group
                            
                                Scrapy: FormRequest doesn't auto-populate ASP.net hidden fields
                            
                                In Fabric, how can I execute tasks from another python file?
                            
                                Check if a string contains date or timestamp in python [closed]
                            
                                Navigating Python modules with ctags in Vim?
                            
                                how to set bounds for the x-axis in one figure containing multiple matplotlib histograms and create just one column of graphs?
                            
                                How to get the Jinja2 generated input value data?
                            
                                Set a whole column in `QTableWidget` read-only in python
                            
                                numpy calculate polynom efficiently
                            
                                Pandas - convert dataframe multi-index to datetime object
                            
                                How is scikit-learn GridSearchCV best_score_ calculated?
                            
                                How do I to translate this json format into correct format that can be used pandas read_json()
                            
                                Exporting a Scikit Learn Random Forest for use on Hadoop Platform
                            
                                Content-length error in google cloud endpoints testing
                            
                                Django Raw Query: Making Count query with group BY clause
                            
                                Set Environmental Variables in Python with Popen
                            
                                scrapy djangoitem with Foreign Key
                            
                                How do I calculate a new column in Pandas based a on trignometric function?
                            
                                Django restframework browsable api login with ouath
                            
                                Converting comma-separated string to list [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert UTF-8 to string literals in Python

Tags:

python

string

utf-8

literals

Tminer

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us