I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).
def compute_score(row):
# some complicated python code that would be painful to convert into SQL-equivalent
return score
The obvious first approach is to simply read in all the data
import psycopg2
def sum_scores(dbname, tablename):
conn = psycopg2.connect(dbname)
cur = conn.cursor()
cur.execute('SELECT * FROM ?', tablename)
rows = cur.fetchall()
sum = 0
for row in rows:
sum += score(row)
conn.close()
return sum
I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall()
seems to me like it would not function correctly in that case.
I was considering 3 approaches, all with the aim of processing a couple records at a time:
One-by-one record processing using fetchone()
def sum_scores(dbname, tablename):
...
sum = 0
for row_num in cur.rowcount:
row = cur.fetchone()
sum += score(row)
...
return sum
Batch-record processing using fetchmany(n)
def sum_scores(dbname, tablename):
...
batch_size = 1e3 # tunable
sum = 0
batch = cur.fetchmany(batch_size)
while batch:
for row in batch:
sum += score(row)
batch = cur.fetchmany(batch_size)
...
return sum
Relying on the cursor's iterator
def sum_scores(dbname, tablename):
...
sum = 0
for row in cur:
sum += score(row)
...
return sum
Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall
?
Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?
How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?
Introduction to Vaex. Vaex is a python library that is an out-of-core dataframe, which can handle up to 1 billion rows per second. 1 billion rows. Yes, you read it right, that too, in a second.
In relational databases, a row is a data record within a table. Each row, which represents a complete record of specific item data, holds different data within the same structure. A row is occasionally referred to as a tuple.
All 3 solutions will work, and only bring a subset of the results into memory.
Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).
Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.
http://initd.org/psycopg/docs/cursor.html#fetch
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With