Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pythonic way to populate numpy array

I find myself parsing lots of data files (usually in a .csv file or similar) using the csv reader and a for loop to iterate over every line. The data is usually a table of floats so for example.

reader = csv.reader(open('somefile.csv'))
header = reader.next()

res_list = [list() for i in header]    

for line in reader:
  for i in range(len(line)):
    res_list[i].append(float(line[i]))

result_dict = dict(zip(header,res_list)) #so we can refer by column title

This is a ok way to populate so I get each column as a separate list however, I would prefer that the default data container for lists of items (and nested lists) be numpy arrays, since 99 times out 100 the numbers get pumped into various processing scripts/functions and having the power of numpy lists makes my life easier.

The numpy append(arr, item) doesn't append in-place and therefore would require re-creating arrays for every point in the table (which is slow and unneccesary). I could also iterate over the list of data-columns and wrap them into an array after I'm done (which is what I've been doing), but sometimes it isn't so clear cut about when I'm done parsing the file and may need to append stuff to the list later down the line anyway.

I was wondering if there is some less-boiler-heavy way (to use the overused phrase "pythonic") to process tables of data in a similar way, or to populate arrays (where the underlying container is a list) dynamically and without copying arrays all the time.

(On another note: its kind of annoying that in general people use columns to organize data but csv reads in rows if the reader incorporated a read_column argument (yes, I know it wouldn't be super efficient), I think many people would avoid having boiler plate code like the above to parse a csv data file. )

like image 598
crasic Avatar asked Sep 09 '11 00:09

crasic


2 Answers

There is numpy.loadtxt:

X = numpy.loadtxt('somefile.csv', delimiter=',')

Documentation.


Edit: for a list of numpy arrays,

X = [scipy.array(line.split(','), dtype='float') 
     for line in open('somefile.csv', 'r')]
like image 179
Steve Tjoa Avatar answered Nov 14 '22 23:11

Steve Tjoa


I think it is difficult to improve very much on what you have. Python lists are relatively cheap to build and append; NumPy arrays are more expensive to create and don't offer a .append() method at all. So your best bet is to build the lists like you already are doing, and then coerce to np.array() when the time comes.

A few small points:

  • It is slightly faster to use [] to create a list than to call list(). This is such a tiny amount of the runtime of the program that you can feel free to ignore this point.

  • When you don't actually use the loop index, you can use _ for the variable name to document this.

  • It's usually better to iterate over a sequence than to find the length of the sequence, build a range(), and then index the sequence a lot. You can use enumerate() to get an index if you also need the index.

Put those together and I think this is a slightly improved version. But it is almost unchanged from your original, and I can't think of any really good improvements.

reader = csv.reader(open('somefile.csv'))
header = reader.next()

res_list = [ [] for _ in header]

for row in reader:
    for i, val in enumerate(row):
        res_list[i].append(float(val))

# build dict so we can refer by column title
result_dict = dict((n, res_list[i]) for i, n in enumerate(header))
like image 24
steveha Avatar answered Nov 14 '22 22:11

steveha