Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can pandas.read_sql_query() query a TEMP table?

I'm in the process of converting Python code over to the new SQLAlchemy-based Pandas 0.14.1.

A common pattern we use is (generically):

connection = db.connect()  # open connection/session

sql = 'CREATE TEMP TABLE table1 AS SELECT ...'
connection.execute(sql)

... other sql that creates TEMP tables from various joins of previous TEMP tables ...

sql = 'CREATE TEMP TABLE tableN AS SELECT ...'
connection.execute(sql)

result = connection.query('SELECT * FROM tableN WHERE ...')

connection.close()

Now, once the connection is closed the TEMP tables are purged by the DB server. However, as the final select query is using the same connection/session, it can access the tables.

How can I achieve similar using SQLAlchemy and pd.read_sql_query() ?

For example:

engine = sqlalchemy.create_engine('netezza://@mydsn')
connection = engine.connect()

sql = 'CREATE TEMP TABLE tmptable AS SELECT ...'
connection.execute(sql)

result = pd.read_sql_query('SELECT * FROM tmptable WHERE ...', engine)

yields a DB error that the TEMP table tmptable doesn't exist. Presumably this is because passing the engine to the read_sql_query() requires it to open a new connection which has an independent session scope and hence can't see the TEMP table. Is that a reasonable assumption?

Is there a way to work around that? (passing the connection to read_sql_query() isn't supported)

(I know that I can concatenate the SQL into a single string with ; separating the statements, but this is a simplification of the actual situation where the TEMP tables are created by a multitude of functions which call others nesting 3-4 deep. So, to achieve that would require implementing a layer than can coalesce the SQL across multiple calls before issuing it, which I'd rather avoid implementing if there is a nicer way)

Using -
Pandas: 0.14.1
sqlalchemy: 0.9.7
pyodbc: 3.0.6
Win7 x86_64 Canopy Python distribution (Python 2.7.6)
Josh Kuhn's Netezza SQLAlchemy dialect from https://github.com/deontologician/netezza_sqlalchemy

like image 852
DavidJ Avatar asked Oct 09 '14 19:10

DavidJ


People also ask

What is Read_sql_query?

pandas read_sql() function is used to read SQL query or database table into DataFrame.

Is Panda faster than SQL?

pandas scales with the data, up to just under 0.5 seconds for 10 million records) filter data (>10x-50x faster with sqlite . The difference is more pronounced as data grows in size) sort by single column: pandas is always a bit slower, but this was the closest.


2 Answers

All you need to do is add 'SET NOCOUNT ON' at the beginning of your query, that way pandas read_sql will read everything as one statement.

sSQL = '''SET NOCOUNT ON
CREATE TABLE ...... '''
like image 152
Carlos Chaccon Avatar answered Sep 28 '22 00:09

Carlos Chaccon


You can now pass SQLAlchemy connectable to pandas.read_sql. From the docs:

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)

...

con : SQLAlchemy connectable (engine/connection) or database string URI

or DBAPI2 connection (fallback mode)

Using SQLAlchemy makes it possible to use any DB supported by that > library. If a DBAPI2 object, only sqlite3 is supported.

So, this should work:

engine = sqlalchemy.create_engine('netezza://@mydsn')
connection = engine.connect()

sql = 'CREATE TEMP TABLE tmptable AS SELECT ...'
connection.execute(sql)

result = pd.read_sql('SELECT * FROM tmptable WHERE ...', con=connection)
like image 42
ssharma Avatar answered Sep 28 '22 00:09

ssharma