Unicode Problem with SQLAlchemy

Tags:

I know I'm having a problem with a conversion from Unicode but I'm not sure where it's happening.

I'm extracting data about a recent Eruopean trip from a directory of HTML files. Some of the location names have non-ASCII characters (such as é, ô, ü). I'm getting the data from a string representation of the the file using regex.

If i print the locations as I find them, they print with the characters so the encoding must be ok:

Le Pré-Saint-Gervais, France
Hôtel-de-Ville, France

I'm storing the data in a SQLite table using SQLAlchemy:

Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))

    def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption):
        self.filename = filename
        self.pdate = pdate
        self.ptime = ptime
        self.location = location
        self.weather = weather
        self.high = high
        self.low = low
        self.lat = lat
        self.lon = lon
        self.image = image
        self.caption = caption

    def __repr__(self):
        return "<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime)

engine = create_engine('sqlite:///:memory:', echo=False)
Base.metadata.create_all(engine)
Session = sessionmaker(bind = engine)
session = Session()

I loop through the files and insert the data from each one into the database:

for filename in filelist:

    # open the file and extract the information using regex such as:
    location_re = re.compile("<h2>(.*)</h2>",re.M)
    # extract other data

    newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption)
    session.add(newpoint)
    session.commit()

I see the following warning on each insert:

/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom'
  param.append(processors[key](compiled_params[key]))

And when I try to do anything with the table such as:

session.query(Point).all()

I get:

Traceback (most recent call last):
  File "./extract_trips.py", line 131, in <module>
    session.query(Point).all()
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all
    return list(self)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances
    fetch = cursor.fetchall()
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall
    self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context)
  File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
    raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None

I would like to be able to correctly store and then return the location names with the original characters intact. Any help would be much appreciated.

419

asked Jun 08 '09 18:06

Dave Forgac

3 Answers

I found this article that helped explain my troubles somewhat:

http://www.amk.ca/python/howto/unicode#reading-and-writing-unicode-data

I was able to get the desired results by using the 'codecs' module and then changing my program as follows:

When opening the file:

infile = codecs.open(filename, 'r', encoding='iso-8859-1')

When printing the location:

print location.encode('ISO-8859-1')

I can now query and manipulate the data from the table without the error from before. I just have to specify the encoding when I output the text.

(I still don't entirely understand how this is working so I guess it's time to learn more about Python's unicode handling...)

answered Oct 14 '22 22:10

Dave Forgac

From sqlalchemy.org

See section 0.4.2

added new flag to String and create_engine(), assert _unicode=(True|False|'warn'|None). Defaults to False or None on create _engine() and String, 'warn' on the Unicode type. When True, results in all unicode conversion operations raising an exception when a non-unicode bytestring is passed as a bind parameter. 'warn' results in a warning. It is strongly advised that all unicode-aware applications make proper use of Python unicode objects (i.e. u'hello' and not 'hello') so that data round trips accurately.

I think you are trying to input a non-unicode bytestring. Perhaps this might lead you on the right track? Some form of conversion is needed, compare 'hello' and u'hello'.

Cheers

answered Oct 14 '22 22:10

ralphtheninja

Try using a column type of Unicode rather than String for the unicode columns:

Base = declarative_base()
class Point(Base):
    __tablename__ = 'points'

    id = Column(Integer, primary_key=True)
    pdate = Column(Date)
    ptime = Column(Time)
    location = Column(Unicode(32))
    weather = Column(String(16))
    high = Column(Float)
    low = Column(Float)
    lat = Column(String(16))
    lon = Column(String(16))
    image = Column(String(64))
    caption = Column(String(64))

Edit: Response to comment:

If you're getting warnings about unicode encodings then there are two things you can try:

Convert your location to unicode. This would mean having your Point created like this:

newpoint = Point(filename, pdate, ptime, unicode(location), weather, high, low, lat, lon, image, caption)

The unicode conversion will produce a unicode string when passed either a string or a unicode string, so you don't need to worry about what you pass in.
If that doesn't solve the encoding issues, try calling encode on your unicode objects. That would mean using code like:

newpoint = Point(filename, pdate, ptime, unicode(location).encode('utf-8'), weather, high, low, lat, lon, image, caption)

This step probably won't be necessary but what it essentially does is converts a unicode object from unicode code-points to a specific byte representation (in this case, utf-8). I'd expect SQLAlchemy to do this for you when you pass in unicode objects but it may not.

answered Oct 14 '22 22:10

workmad3

Related questions
                            
                                What's the fastest way to recursively search for files in python?
                            
                                Using asyncio for Non-async Functions in Python?
                            
                                How to use python csv.DictReader with a binary file? (For a babel custom extraction method)
                            
                                Add values to dict of list in Python?
                            
                                Finding day of the week for a datetime64
                            
                                Convert first row of pandas dataframe to column name
                            
                                Unable to join pandas dataframe on string type
                            
                                django-channels: No route found for path
                            
                                How can I get unstuck from CondaUpgradeError "A newer version of conda is required."?
                            
                                How to fix 'java.lang.module.FindException: module java.se.ee not found' error when packaging my kivy application with buildozer
                            
                                Using categorical variables in statsmodels OLS class
                            
                                How to check that a string contains only “a-z”, “A-Z” and “0-9” characters [duplicate]
                            
                                Get matrix image of numpy array values - Grid with pixel values inside (not colors)
                            
                                sqlalchemy.exc.InvalidRequestError: Could not reflect: requested table(s) not available in Engine
                            
                                Efficiently remove partial duplicates in a list of tuples
                            
                                TypeError: load() missing 1 required positional argument: 'Loader' in Google Colab
                            
                                What do I need to import to gain access to my models?
                            
                                Python module that implements ftps
                            
                                Regex for managing escaped characters for items like string literals
                            
                                Django: How to use stored model instances as form choices?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode Problem with SQLAlchemy

Tags:

python

character-encoding

encoding

unicode

sqlalchemy

Dave Forgac

People also ask

3 Answers

Dave Forgac

ralphtheninja

workmad3

Recent Activity

Donate For Us