I have an sqlite database that was populated by an external program. Im trying to read the data with python. When I attempt to read the data I get the following error:
OperationalError: Could not decode to UTF-8
If I open the database in sqlite manager and look at the data in the offending record(s) using the inbuilt browse and search it looks fine, however if I export the table as csv, I notice the character £ in the offending records has become £
If I read the csv in python, the £ in the offending records is still read as £ but its not a problem I can parse this manually. However I need to be able to read the data direct from the database, without the intermediate step of converting to csv.
I have looked at some answers online for similar questions, I have so far tried setting "text_factory = str" and I have also tried changing the datatype of the column from TEXT to BLOB using sqlite manager, but still get the error.
My code below results in the OperationalError: Could not decode to UTF-8
conn = sqlite3.connect('test.db')
conn.text_factory = str
curr = conn.cursor()
curr.execute('''SELECT xml_dump FROM hands_1 LIMIT 5000 , 5001''')
row = curr.fetchone()
All the records above 5000 in the database have this character problem and hence produce the error.
Any help appreciated.
Python is trying to be helpful by converting pieces of text (stored as bytes in a database) into a python str
object for you. In order to do this conversion, python has to guess what letter each byte (or group of bytes) returned by your query represents. The default guess is an encoding called utf-8. Obviously, this guess is wrong in your case.
The solution is to give python a little hint as to how to do the mapping from bytes to letters (i.e., unicode characters). You've already come close with the line
conn.text_factory = str
However (based on your response in the comments above), since you are using python 3, str
is the default text factory, so that line will do nothing new for you (see the docs).
What happens behind the scenes with this line is that python tries to convert the bytes returned by the query using the str
function, kind of like:
your_string = str(the_bytes, 'utf-8') # actually uses `conn.text_factory`, not `str`
...but you want a different encoding where 'utf-8' is. Since you can't change the default encoding of the str
function, you will have to mimic it some other way. You can use a one-off nameless function called a lambda for this:
conn.text_factory = lambda x: str(x, 'latin1')
Now when the database is handing the bytes to python, python will try to map them to letters using the 'latin1' scheme instead of the 'utf-8' scheme. Of course, I don't know if latin1 is the correct encoding of your data. Realistically, you will have to try a handful of encodings to find the right one. I would try the following first:
'iso-8859-1'
'utf-16'
'utf-32'
'latin1'
You can find a more complete list here.
Another option is to simply let the bytes coming out of the database remain as bytes. Whether this is a good idea for you depends on your application. You can do it by setting:
conn.text_factory = bytes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With