I'm in the process of converting Python code over to the new SQLAlchemy-based Pandas 0.14.1.
A common pattern we use is (generically):
connection = db.connect() # open connection/session
sql = 'CREATE TEMP TABLE table1 AS SELECT ...'
connection.execute(sql)
... other sql that creates TEMP tables from various joins of previous TEMP tables ...
sql = 'CREATE TEMP TABLE tableN AS SELECT ...'
connection.execute(sql)
result = connection.query('SELECT * FROM tableN WHERE ...')
connection.close()
Now, once the connection is closed the TEMP tables are purged by the DB server. However, as the final select query is using the same connection/session, it can access the tables.
How can I achieve similar using SQLAlchemy and pd.read_sql_query() ?
For example:
engine = sqlalchemy.create_engine('netezza://@mydsn')
connection = engine.connect()
sql = 'CREATE TEMP TABLE tmptable AS SELECT ...'
connection.execute(sql)
result = pd.read_sql_query('SELECT * FROM tmptable WHERE ...', engine)
yields a DB error that the TEMP table tmptable doesn't exist. Presumably this is because passing the engine to the read_sql_query() requires it to open a new connection which has an independent session scope and hence can't see the TEMP table. Is that a reasonable assumption?
Is there a way to work around that? (passing the connection to read_sql_query() isn't supported)
(I know that I can concatenate the SQL into a single string with ; separating the statements, but this is a simplification of the actual situation where the TEMP tables are created by a multitude of functions which call others nesting 3-4 deep. So, to achieve that would require implementing a layer than can coalesce the SQL across multiple calls before issuing it, which I'd rather avoid implementing if there is a nicer way)
Using -
Pandas: 0.14.1
sqlalchemy: 0.9.7
pyodbc: 3.0.6
Win7 x86_64 Canopy Python distribution (Python 2.7.6)
Josh Kuhn's Netezza SQLAlchemy dialect from https://github.com/deontologician/netezza_sqlalchemy
pandas read_sql() function is used to read SQL query or database table into DataFrame.
pandas scales with the data, up to just under 0.5 seconds for 10 million records) filter data (>10x-50x faster with sqlite . The difference is more pronounced as data grows in size) sort by single column: pandas is always a bit slower, but this was the closest.
All you need to do is add 'SET NOCOUNT ON' at the beginning of your query, that way pandas read_sql will read everything as one statement.
sSQL = '''SET NOCOUNT ON
CREATE TABLE ...... '''
You can now pass SQLAlchemy connectable to pandas.read_sql
. From the docs:
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
...
con : SQLAlchemy connectable (engine/connection) or database string URI
or DBAPI2 connection (fallback mode)
Using SQLAlchemy makes it possible to use any DB supported by that > library. If a DBAPI2 object, only sqlite3 is supported.
So, this should work:
engine = sqlalchemy.create_engine('netezza://@mydsn')
connection = engine.connect()
sql = 'CREATE TEMP TABLE tmptable AS SELECT ...'
connection.execute(sql)
result = pd.read_sql('SELECT * FROM tmptable WHERE ...', con=connection)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With