My main goal is to calculate median(by columns) from a HUGE matrix of floats. Example:
a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))
numpy.median(a, axis=0)
Out[38]: array([ 1., 5., 8., 2., 3.])
The matrix is too big to fit in the Python memory (~5 terabytes), so I keep it in a csv file. So I want to run over each column and calculate median.
Is there any way for me to get column iterator without reading the whole file?
Any other ideas about calculating the median for the matrix would be good too. Thank you!
If you can fit each column into memory (which you seem to imply you can), then this should work:
import itertools
import csv
def columns(file_name):
with open(file_name) as file:
data = csv.reader(file)
columns = len(next(data))
for column in range(columns):
with open(file_name) as file:
data = csv.reader(file)
yield [row[column] for row in data]
This works by finding out how many columns we have, then looping over the file, taking the current column's item out of each row. This means, at most, we are using the size of a column plus the size of a row of memory at one time. It's a pretty simple generator. Note we have to keep reopening the file, as we exhaust the iterator when we loop through it.
I would do this by initializing N empty files, one for each column. Then read the matrix one row at a time and send each column entry to the correct file. Once you've processed the whole matrix, go back and calculate the median of each file sequentially.
This basically uses the filesystem to do a matrix transpose. Once transposed, calculating the median of each row is easy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With