Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3.4: str : AttributeError: 'str' object has no attribute 'decode

I have this code part of a function that replace badly encoded foreign characters from a string :

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s)
# b"String from an old database with weird mixed encodings"

I need here a "real" string, not bytes. But whend i want to decode them, i have an exception :

s = "String from an old database with weird mixed encodings"
s = str(bytes(odbc_str.strip(), 'cp1252'))
s = s.replace('\\x82', 'é')
s = s.replace('\\x8a', 'è')
(...)
print(s.decode("utf-8"))
# AttributeError: 'str' object has no attribute 'decode'
  • Do you know why s is bytes here ?
  • Why can't i decode it to a real string ?
  • Do you know how to do it the clean way ? (today i return s[2:][:-1]. Working but very ugly, and i would like to understand this behavior)

Thanks in advance !

EDIT :

pypyodbc in python3 use all unicode by default. That confused me. On connect, you can tell him to use ANSI.

con_odbc = pypyodbc.connect("DSN=GP", False, False, 0, False)

Then, i can convert the returned stuffs into cp850, which is the initial codepage of the database.

str(odbc_str, "cp850", "replace")

No more need to manualy replace each special character. Thank you very much pepr

like image 938
Romu Avatar asked Sep 24 '14 10:09

Romu


1 Answers

The printed b"String from an old database with weird mixed encodings" is not the representation of the string content. It is the value of the string content. As you did not pass the encoding argument to str()... (see the doc https://docs.python.org/3.4/library/stdtypes.html#str)

If neither encoding nor errors is given, str(object) returns object.__str__(), which is the “informal” or nicely printable string representation of object. For string objects, this is the string itself. If object does not have a __str__() method, then str() falls back to returning repr(object).

This is what happened in your case. The b" are actually two characters that are the part of the string content. You can also try:

s1 = 'String from an old database with weird mixed encodings'
print(type(s1), repr(s1))
by = bytes(s1, 'cp1252')
print(type(by), repr(by))
s2 = str(by)
print(type(s2), repr(s2))

and it prints:

<class 'str'> 'String from an old database with weird mixed encodings'
<class 'bytes'> b'String from an old database with weird mixed encodings'
<class 'str'> "b'String from an old database with weird mixed encodings'"

This is the reason why s[2:][:-1] works for you.

If you think more about it, then (in my opinion) or you want to get bytes or bytearray from the database (if possible), and to fix the bytes (see bytes.translate https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#bytes.translate) or you successfully get the string (being lucky that there was no exception when constructing that string), and you want to replace the wrong characters by the correct characters (see also str.translate() https://docs.python.org/3.4/library/stdtypes.html?highlight=translate#str.translate).

Possibly, the ODBC used internally the wrong encoding. (That is the content of the database may be correct, but it was misinterpreted by the ODBC, and you are not able to tell the ODBC what is the correct encoding.) Then you want to encode the string back to bytes using that wrong encoding, and then decode the bytes using the right encoding.

like image 133
pepr Avatar answered Oct 27 '22 00:10

pepr