I am currently querying data into dataframe via the pandas.io.sql.read_sql()
command. I wanted to parallelize the calls similar to what this guys is advocating: (Embarrassingly parallel database calls with Python (PyData Paris 2015 ))
Something like (very general):
pools = [ThreadedConnectionPool(1,20,dsn=d) for d in dsns]
connections = [pool.getconn() for pool in pools]
parallel_connection = ParallelConnection(connections)
pandas_cursor = parallel_connection.cursor()
pandas_cursor.execute(my_query)
Is something like that possible?
pandas scales with the data, up to just under 0.5 seconds for 10 million records) filter data (>10x-50x faster with sqlite . The difference is more pronounced as data grows in size) sort by single column: pandas is always a bit slower, but this was the closest.
Yes, this should work, although with the caveat that you'll need to change parallel_connection.py in that talk that you site. In that code there's a fetchall
function which executes each of the cursors in parallel, then combines the results. This is the core of what you'll change:
Old Code:
def fetchall(self):
results = [None] * len(self.cursors)
def do_work(index, cursor):
results[index] = cursor.fetchall()
self._do_parallel(do_work)
return list(chain(*[rs for rs in results]))
New Code:
def fetchall(self):
results = [None] * len(self.sql_connections)
def do_work(index, sql_connection):
sql, conn = sql_connection # Store tuple of sql/conn instead of cursor
results[index] = pd.read_sql(sql, conn)
self._do_parallel(do_work)
return pd.DataFrame().append([rs for rs in results])
Repo: https://github.com/godatadriven/ParallelConnection
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With