Python

Question

My main goal is to calculate median(by columns) from a HUGE matrix of floats. Example:

a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))

numpy.median(a, axis=0)

Out[38]: array([ 1.,  5.,  8.,  2.,  3.])

The matrix is too big to fit in the Python memory (~5 terabytes), so I keep it in a csv file. So I want to run over each column and calculate median.

Is there any way for me to get column iterator without reading the whole file?

Any other ideas about calculating the median for the matrix would be good too. Thank you!

Gareth Latty · Accepted Answer

If you can fit each column into memory (which you seem to imply you can), then this should work:

import itertools
import csv

def columns(file_name):
   with open(file_name) as file:
       data = csv.reader(file)
       columns = len(next(data))
   for column in range(columns):
       with open(file_name) as file:
           data = csv.reader(file)
           yield [row[column] for row in data]

This works by finding out how many columns we have, then looping over the file, taking the current column's item out of each row. This means, at most, we are using the size of a column plus the size of a row of memory at one time. It's a pretty simple generator. Note we have to keep reopening the file, as we exhaust the iterator when we loop through it.

Keith Randall · Answer

I would do this by initializing N empty files, one for each column. Then read the matrix one row at a time and send each column entry to the correct file. Once you've processed the whole matrix, go back and calculate the median of each file sequentially.

This basically uses the filesystem to do a matrix transpose. Once transposed, calculating the median of each row is easy.

Python - get column iterator from a file (without reading the whole file)

Tags:

numpy

median

dbaron

2 Answers

Gareth Latty

Keith Randall

Recent Activity

Donate For Us