Processing each row of a large database table in Python

Context

I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).

def compute_score(row):
  # some complicated python code that would be painful to convert into SQL-equivalent
  return score

The obvious first approach is to simply read in all the data

import psycopg2

def sum_scores(dbname, tablename):
  conn = psycopg2.connect(dbname)
  cur = conn.cursor()
  cur.execute('SELECT * FROM ?', tablename)
  rows = cur.fetchall()
  sum = 0
  for row in rows:
    sum += score(row)
  conn.close()
  return sum

Problem

I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall() seems to me like it would not function correctly in that case.

Proposed Solutions

I was considering 3 approaches, all with the aim of processing a couple records at a time:

One-by-one record processing using fetchone()

def sum_scores(dbname, tablename):
  ...
  sum = 0
  for row_num in cur.rowcount:
    row = cur.fetchone()
    sum += score(row)
  ...
  return sum

Batch-record processing using fetchmany(n)

def sum_scores(dbname, tablename):
  ...
  batch_size = 1e3 # tunable
  sum = 0
  batch = cur.fetchmany(batch_size)  
  while batch:
    for row in batch:
      sum += score(row)
    batch = cur.fetchmany(batch_size)
  ...
  return sum

Relying on the cursor's iterator

def sum_scores(dbname, tablename):
  ...
  sum = 0
  for row in cur:
    sum += score(row)
  ...
  return sum

Questions

Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall?
Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?
How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?

Can python handle 1 billion rows?

Introduction to Vaex. Vaex is a python library that is an out-of-core dataframe, which can handle up to 1 billion rows per second. 1 billion rows. Yes, you read it right, that too, in a second.

What is row in data processing?

In relational databases, a row is a data record within a table. Each row, which represents a complete record of specific item data, holds different data within the same structure. A row is occasionally referred to as a tuple.

All 3 solutions will work, and only bring a subset of the results into memory.

Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).

Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.

http://initd.org/psycopg/docs/cursor.html#fetch

Processing each row of a large database table in Python

Tags:

python

psycopg2

bigdata

Context

Problem

Proposed Solutions

Questions

Pedro Cattori

People also ask

1 Answers

jastr

Recent Activity

Donate For Us

Processing each row of a large database table in Python

Tags:

python

psycopg2

bigdata

Context

Problem

Proposed Solutions

Questions

Pedro Cattori

People also ask

1 Answers

jastr

Related questions

Recent Activity

Donate For Us