I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database. However for some characters, it explodes. I get complaints like this: <pre class="prettyprint"><code>UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128) </code></pre> Is there some way I can convert the chars to proper unicode versions? Or strip them out?

Once you have the string of bytes <code>s</code>, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.: <pre class="prettyprint"><code>u = s.decode('latin-1') </code></pre> and use <code>u</code> instead of <code>s</code> in the code that follows this point (presumably the part that writes to sqlite). That's assuming <code>latin-1</code> is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-). As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

Convert or strip out "illegal" Unicode characters

Tags:

python

unicode

pymssql

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.

However for some characters, it explodes. I get complaints like this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)

Is there some way I can convert the chars to proper unicode versions? Or strip them out?

313

asked Mar 24 '10 15:03

Oli

1 Answers

Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:

u = s.decode('latin-1')

and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).

As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.

194

answered Sep 28 '22 22:09

Alex Martelli

Related questions
                            
                                Can I have a simple list of a dataclass field
                            
                                Unsupported operand type(s) for +: 'WindowsPath' and 'str'
                            
                                Autocomplete in Jupyter notebook not working
                            
                                Find entries that do not match between columns and iterate through columns
                            
                                Return aggregate for all unique in a group
                            
                                How to deal with multi-level column names downloaded with yfinance
                            
                                VSCode Jupyter Extension: Rich syntax highlighting not working?
                            
                                Check if list is valid sequence of chunks
                            
                                Python best way to 'swap' words (multiple characters) in a string?
                            
                                Customized command line parsing in Python
                            
                                What's a good library to manipulate Apache2 config files? [closed]
                            
                                How do I set sys.excepthook to invoke pdb globally in python?
                            
                                SHA256 hash in Python 2.4
                            
                                Is there a way to reopen a socket?
                            
                                Class usage in Python
                            
                                Why does else behave differently in for/while statements as opposed to if/try statements?
                            
                                What linux distro is better suited for Python web development?
                            
                                Symbolic Group Names (like in Python) in Ruby Regular Expression
                            
                                Python 2.6.4 property decorators not working
                            
                                overloading augmented arithmetic assignments in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With