I have a problem with encoding of the path variable and inserting it to the SQLite database. I tried to solve it with encode("utf-8") function which didn't help. Then I used unicode() function which gives me type unicode.
print type(path) # <type 'unicode'> path = path.replace("one", "two") # <type 'str'> path = path.encode("utf-8") # <type 'str'> strange path = unicode(path) # <type 'unicode'>
Finally I gained unicode type, but I still have the same error which was present when the type of the path variable was str
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
Could you help me solve this error and explain the correct usage of encode("utf-8")
and unicode()
functions? I'm often fighting with it.
EDIT:
This execute() statement raised the error:
cur.execute("update docs set path = :fullFilePath where path = :path", locals())
I forgot to change the encoding of fullFilePath variable which suffers with the same problem, but I'm quite confused now. Should I use only unicode() or encode("utf-8") or both?
I can't use
fullFilePath = unicode(fullFilePath.encode("utf-8"))
because it raises this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 32: ordinal not in range(128)
Python version is 2.7.2
Definition. The Python encode() is a built-in string method that is used to return an encoded version of the string according to the encoded standard. Python encode() string function is used to secure the string by encoding it based on the specified encoding type.
Remarks. If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
Unicode is a universal character encoding standard. This standard includes roughly 100000 characters to represent characters of different languages. While ASCII uses only 1 byte the Unicode uses 4 bytes to represent characters. Hence, it provides a very wide variety of encoding.
Unicode, formally The Unicode Standard is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
str
is text representation in bytes, unicode
is text representation in characters.
You decode text from bytes to unicode and encode a unicode into bytes with some encoding.
That is:
>>> 'abc'.decode('utf-8') # str to unicode u'abc' >>> u'abc'.encode('utf-8') # unicode to str 'abc'
UPD Sep 2020: The answer was written when Python 2 was mostly used. In Python 3, str
was renamed to bytes
, and unicode
was renamed to str
.
>>> b'abc'.decode('utf-8') # bytes to str 'abc' >>> 'abc'.encode('utf-8'). # str to bytes b'abc'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With