Python + PostgreSQL + strange ascii = UTF8 encoding error

Tags:

I have ascii strings which contain the character "\x80" to represent the euro symbol:

>>> print "\x80"
€

When inserting string data containing this character into my database, I get:

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the encodi
ng expected by the server, which is controlled by "client_encoding".

I'm a unicode newbie. How can I convert my strings containing "\x80" to valid UTF-8 containing that same euro symbol? I've tried calling .encode and .decode on various strings, but run into errors:

>>> "\x80".encode("utf-8")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    "\x80".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

953

asked Jun 07 '10 17:06

Claudiu

1 Answers

The question starts with a false premise:

I have ascii strings which contain the character "\x80" to represent the euro symbol.

ASCII characters are in the range "\x00" to "\x7F" inclusive.

The previously-accepted now-deleted answer operated under two gross misapprehensions (1) that locale == encoding (2) that the latin1 encoding maps "\x80" to a Euro character.

In fact, all of the ISO-8859-x encodings map "\x80" to U+0080 which is one of the C1 control characters, not a Euro character. Only 3 of those encodings (x in (7, 15, 16)) provide the Euro character, as "\xA4". See this Wikipedia article.

You need to know what encoding your data is in. What machine was it created on? How? The locale it was created in (not necessarily yours) may give you a clue.

Note that "My data is encoded in latin1" is up there with "The cheque's in the mail" and "Of course I'll love you in the morning". Your data is probably encoded in one of the cp125x encodings found on Windows platforms. Note that all of them except cp1251 (Windows Cyrillic) map "\x80" to the euro character:

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

Update in response to the OP's comment

I'm reading this data from a file, e.g. open(fname).read(). It contains strings with \x80 in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs "\x80" for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro.

This is a bit confusing: First you say

It contains strings with \x80 in them that represents the euro character

But later you say

I'm thinking I can assume that it outputs "\x80" for a euro character

Please explain.

Selecting an appropriate cp125x encoding: Where (geographical location) was the file created? In what language(s) is the text written? Any characters other than the presumed euro with values > "\x7f"? If so, which ones and what context are they used in?

Update 2 If you don't "know how the program is written", neither you nor we can form an opinion on whether it always uses "\x80" for the euro character. Although doing otherwise would be monumental silliness, it can't be ruled out.

If the text is written in the English language and/or it is written in the USA, and/or it's written on a Windows platform, then it's reasonably certain that cp1252 is the way to go ... until you get evidence to the contrary, in which case you'd need to guess an encoding by yourself or answer the (what language, what locality) questions.

158

answered Sep 28 '22 20:09

John Machin

Related questions
                            
                                Fabric error No handlers could be found for logger "paramiko.transport"
                            
                                Best way to add python scripting into QT application?
                            
                                What is the significance of a function without a 'self' argument insde a class?
                            
                                How to detect flash drive plug-in in Windows using Python?
                            
                                Changing schema using cx_Oracle
                            
                                Creating an image editing application in Python
                            
                                Python logging over multiple files
                            
                                How do I plot a graph in Python?
                            
                                Python GTK+: create custom signals?
                            
                                Running Python & Django on IIS
                            
                                Using LaTeX Beamer to display code
                            
                                Reformatting code with Regular Expressions
                            
                                How to redirect to a query string URL containing non-ascii characters in DJANGO?
                            
                                Fast way to get N Min or Max elements from a list in Python
                            
                                Python: Can subclasses overload inherited methods?
                            
                                How can I remove all words that end in ":" from a string in Python?
                            
                                get_or_create generic relations in Django & python debugging in general
                            
                                Google app engine How to count SUM from datestore?
                            
                                compare two windows paths, one containing tilde, in python
                            
                                Creating a unique key based on file content in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python + PostgreSQL + strange ascii = UTF8 encoding error

Tags:

python

postgresql

encoding

unicode

utf-8

Claudiu

People also ask

1 Answers

John Machin

Recent Activity

Donate For Us