Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - get column iterator from a file (without reading the whole file)

My main goal is to calculate median(by columns) from a HUGE matrix of floats. Example:

a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))

numpy.median(a, axis=0)

Out[38]: array([ 1.,  5.,  8.,  2.,  3.])

The matrix is too big to fit in the Python memory (~5 terabytes), so I keep it in a csv file. So I want to run over each column and calculate median.

Is there any way for me to get column iterator without reading the whole file?

Any other ideas about calculating the median for the matrix would be good too. Thank you!

like image 328
dbaron Avatar asked Sep 22 '12 21:09

dbaron


2 Answers

If you can fit each column into memory (which you seem to imply you can), then this should work:

import itertools
import csv

def columns(file_name):
   with open(file_name) as file:
       data = csv.reader(file)
       columns = len(next(data))
   for column in range(columns):
       with open(file_name) as file:
           data = csv.reader(file)
           yield [row[column] for row in data]

This works by finding out how many columns we have, then looping over the file, taking the current column's item out of each row. This means, at most, we are using the size of a column plus the size of a row of memory at one time. It's a pretty simple generator. Note we have to keep reopening the file, as we exhaust the iterator when we loop through it.

like image 169
Gareth Latty Avatar answered Sep 24 '22 03:09

Gareth Latty


I would do this by initializing N empty files, one for each column. Then read the matrix one row at a time and send each column entry to the correct file. Once you've processed the whole matrix, go back and calculate the median of each file sequentially.

This basically uses the filesystem to do a matrix transpose. Once transposed, calculating the median of each row is easy.

like image 28
Keith Randall Avatar answered Sep 24 '22 03:09

Keith Randall