Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing each row of a large database table in Python

Context

I have a function in python that scores a row in my table. I would like to combine the scores of all the rows arithmetically (eg. computing the sum, average, etc.. of the scores).

def compute_score(row):
  # some complicated python code that would be painful to convert into SQL-equivalent
  return score

The obvious first approach is to simply read in all the data

import psycopg2

def sum_scores(dbname, tablename):
  conn = psycopg2.connect(dbname)
  cur = conn.cursor()
  cur.execute('SELECT * FROM ?', tablename)
  rows = cur.fetchall()
  sum = 0
  for row in rows:
    sum += score(row)
  conn.close()
  return sum

Problem

I would like to be able to handle as much data as my database can hold. This could be larger that what would fit into Python's memory, so fetchall() seems to me like it would not function correctly in that case.

Proposed Solutions

I was considering 3 approaches, all with the aim of processing a couple records at a time:

  1. One-by-one record processing using fetchone()

    def sum_scores(dbname, tablename):
      ...
      sum = 0
      for row_num in cur.rowcount:
        row = cur.fetchone()
        sum += score(row)
      ...
      return sum
    
  2. Batch-record processing using fetchmany(n)

    def sum_scores(dbname, tablename):
      ...
      batch_size = 1e3 # tunable
      sum = 0
      batch = cur.fetchmany(batch_size)  
      while batch:
        for row in batch:
          sum += score(row)
        batch = cur.fetchmany(batch_size)
      ...
      return sum
    
  3. Relying on the cursor's iterator

    def sum_scores(dbname, tablename):
      ...
      sum = 0
      for row in cur:
        sum += score(row)
      ...
      return sum
    

Questions

  1. Was my thinking correct in that my 3 proposed solutions would only pull in manageable sized chunks of data at a time? Or do they suffer from the same problem as fetchall?

  2. Which of the 3 proposed solutions would work (ie. compute the correct score combination and not crash in the process) for LARGE datasets?

  3. How does the cursor's iterator (Proposed Solution #3) actually pull in data into Python's memory? One-by-one, in batches, or all at once?

like image 790
Pedro Cattori Avatar asked Oct 17 '15 21:10

Pedro Cattori


People also ask

Can python handle 1 billion rows?

Introduction to Vaex. Vaex is a python library that is an out-of-core dataframe, which can handle up to 1 billion rows per second. 1 billion rows. Yes, you read it right, that too, in a second.

What is row in data processing?

In relational databases, a row is a data record within a table. Each row, which represents a complete record of specific item data, holds different data within the same structure. A row is occasionally referred to as a tuple.


1 Answers

All 3 solutions will work, and only bring a subset of the results into memory.

Iterating via the cursor, Proposed solution #3, will work the same as Proposed Solution #2, if you pass a name to the cursor. Iterating over the cursor will fetch itersize records (default is 2000).

Solutions #2 and #3 will be much quicker than #1, because there is much less of a connection overhead.

http://initd.org/psycopg/docs/cursor.html#fetch

like image 119
jastr Avatar answered Oct 24 '22 04:10

jastr