I am trying to parse an RSS feed with feedparser and insert it into a mySQL table using SQLAlchemy. I was actually able to get this running just fine but today the feed had an item with an ellipsis character in the description and I get the following error:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
If I add the convert_unicode=True option to the engine I am able to get the insert to go through but the ellipsis doesn't show up it's just weird characters. This seems to make sense since to the best of my knowledge there is no horizontal ellipsis in latin-1. Even if I set the encoding to utf-8 it doesn't seem to make a difference. If I do an insert using phpmyadmin and include the ellipsis it goes through fine.
I'm thinking I just don't understand character encodings or how to get SQLAlchemy to use one I specify. Does anyone know how to get the text to go in without weird characters?
UPDATE
I think I have figured this one out but I'm not really sure why it matters...
Here is the code:
import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table
COMMON_CHANNEL_PROPERTIES = [
('Channel title:','title', None),
('Channel description:', 'description', 100),
('Channel URL:', 'link', None),
]
COMMON_ITEM_PROPERTIES = [
('Item title:', 'title', None),
('Item description:', 'description', 100),
('Item URL:', 'link', None),
]
INDENT = u' '*4
def feedinfo(url, output=sys.stdout):
feed_data = feedparser.parse(url)
channel, items = feed_data.feed, feed_data.entries
#adding charset=utf8 here is what fixed the problem
db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
metadata = MetaData(db)
rssItems = Table('rss_items', metadata,autoload=True)
i = rssItems.insert();
for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
value = channel[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, label, value
print >> output
print >> output, "Feed items:"
for item in items:
i.execute({'title':item['title'], 'description': item['description'][:100]})
for label, prop, trunc in COMMON_ITEM_PROPERTIES:
value = item[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, INDENT, label, value
print >> output, INDENT, u'---'
return
if __name__=="__main__":
url = sys.argv[1]
feedinfo(url)
Here's the output/traceback from running the code without the charset option:
Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com
Feed items:
Item title: Windows 8 UI is Dropping the 'Start' Button
Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
---
Item title: Which Crashes More? Apple Apps or Android Apps
Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
---
Traceback (most recent call last):
File "parse.py", line 47, in <module>
feedinfo(url)
File "parse.py", line 36, in feedinfo
i.execute({'title':item['title'], 'description': item['description'][:100]})
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
return e._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
return connection._execute_clauseelement(elem, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
context)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
cursor.execute(statement, parameters)
File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
So it looks like adding the charset to the mysql connect string did it. I suppose it defaults to latin-1? I had tried setting the encoding flag on content_engine to utf8 and that did nothing. Anyone know why it would use latin-1 when the tables and fields are set to utf8 unicode? I also tried encoding item['description] using .encode('cp1252') before sending it off and that worked as well even without adding the charset option to the connection string. That shouldn't have worked with latin-1 but apparently it did? I've got the solution but would love an answer :)
SQLAlchemy supports MySQL starting with version 5.0. 2 through modern releases, as well as all modern versions of MariaDB.
Finally, you have the hostname or IP address of the database and the database name. These data are all you need to establish a connection. The port is optional, but SQLAlchemy is smart enough to know the MySQL database resides at port 3306. Finally, you create the connection object and invoke the connect method.
The create_engine() method of sqlalchemy library takes in the connection URL and returns a sqlalchemy engine that references both a Dialect and a Pool, which together interpret the DBAPI's module functions as well as the behavior of the database.
Supported Databases. SQLAlchemy includes dialects for SQLite, Postgresql, MySQL, Oracle, MS-SQL, Firebird, Sybase and others, most of which support multiple DBAPIs.
The error message
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026'
in position 35: ordinal not in range(256)
seems to indicate that some Python language code is trying to convert the character \u2026
into a Latin-1 (ISO8859-1) string, and it is failing. Not surprising, that character is U+2026 HORIZONTAL ELLIPSIS
, which has no single equivalent character in ISO8859-1.
You fixed the problem by adding the query ?charset=utf8
in your SQLAlchemy connection call:
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table
db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
The section Database Urls of the SQLAlchemy documentation tells us that a URL beginning with mysql
indicates a MySQL dialect, using the mysql-python
driver.
The following section, Custom DBAPI connect() arguments, tells us that query arguments are passed to the underlying DBAPI.
So, what does the mysql-python
driver make of a parameter {charset: 'utf8'}
? Section Functions and attributes of their documentation says of the charset
attribute "...If present, the connection character set will be changed to this character set, if they are not equal."
To find out what the connection character set means, we turn to 10.1.4. Connection Character Sets and Collations of the MySQL 5.6 reference manual. To make a long story short, MySQL can have interpret incoming queries as an encoding different than the database's character set, and different than the encoding of the returned query results.
Since the error message you reported looks like a Python rather than a SQL error message, I'll speculate that something in SQLAlchemy or mysql-python is attempting to convert the query to a default connection encoding of latin-1
before sending it. This is what triggers the error. However, the query string ?charset=utf8
in your connect()
call changes the connection encoding, and the U+2026 HORIZONTAL ELLIPSIS
is able to get through.
Update: you also ask, "if I remove the charset option and then encode the description using .encode('cp1252') it will go through just fine. How is an ellipsis able to get through with cp1252 but not unicode?"
The encoding cp1252
has a horizontal ellipsis character at byte value \x85
. Thus it is possible to encode a Unicode string containing U+2026 HORIZONTAL ELLIPSIS
into cp1252 without error.
Remember also that in Python, Unicode strings and byte strings are two different data types. It's reasonable to speculate that MySQLdb might have a policy of sending only byte strings over a SQL connection. Thus it would encode a query received as a Unicode string into a byte string, but would leave a query received as a byte string alone. (This is speculation, I haven't looked at the source code.)
In the traceback you posted, the last two lines (closest to where the error occur) show the method names literal
, followed by unicode_literal
. That tends to support the theory that MySQLdb is encoding the query it receives as a Unicode string into a byte string.
When you encode the query string yourself, you bypass the part of MySQLdb that does this encoding differently. Note, however, that if you encode the query string differently than the MySQL connection charset calls for, then you'll have an encoding mismatch, and your text will likely be stored wrong.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With