I am working on a project that involves doing calculations using large Matrices of data. I have CSV files with 10,000 rows and 100 columns, and there are 10 of them. Currently, I'm running a background job that reads the data from each CSV, pulls it into an array, runs some matrix multiplication calculations on the data, and then moves to the next CSV. I'm sure there is a better way to do this because it seems like the majority of the time it takes to process the job is spent opening the CSV's. My question really boils down to how I should store the data that is currently in those CSV files to easily access it and run the calculations in a more efficient way. Any help would be appreciated
EDIT
As suggested in the comments, I'd like to add that the matrix density is 100% and the numbers are all floats.
CSV is a very, very inefficiant format for any kind of large data. Given that all of your data is in numbers, and the fact that your data sizes are consistent, a compact binary format would be best. If you store your data as a binary file of 1,000,000 4 byte ints in network byte order, where the first hundred are the first row, second the second, and so on, it would cut your file size to ~8MB from 12MB, and completely remove the inefficiency of parsing CSV (which is really inefficient). To convert your data to this format, try running this Ruby code (I assume that data is a 2d array of your CSV):
newdat = data.flatten.map {|e| e.to_f}.pack("G*")
Then write newdat to a file as your new data:
f = File.open("data.dat", 'wb')
f.write(newdat)
f.close
To parse this data from a file:
data = File.open("data.dat", 'rb').read.unpack("G*").each_slice(100).to_a
This will set data to your matrix as a 2d array.
Note: I can't actually give you hard numbers for the efficiency of this, as I don't have any giant CSV files full of floats lying around. However, this should be much more efficient.
Have you considered using Marshal to save the array in binary? I haven't used it, but it seems dead-simple:
FNAME = 'matrix4.mtx'
a = [2.3, 1.4, 6.7]
File.open(FNAME, 'wb') {|f| f.write(Marshal.dump(a))}
b = Marshal.load(File.binread(FNAME)) # => [2.3,1.4,6.7]
Of course, you'd have to read the entire array into memory, but the arrays don't seem that big by current standards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With