Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV Random Access; C#

I have a 10GB CSV file which is essentially a huge square matrix. I am trying to write a function that can access a single cell of the matrix as efficiently as possible, ie matrix[12345,20000].

Given its size, it is obviously not possible to load the entire matrix into a 2D array, I need to somehow read the values direct from the file.

I have Googled around looking at file random access using FileStream.Seek, however unfortunately due to variable rounding each cell isn't a fixed width. It would not be possible for me to seek to a specific byte and know what cell I'm looking at by some sort of arithmetic.

I considered scanning the file and creating a lookup table for the index of the first byte of each row. That way, if I wanted to access matrix[12345,20000] I would seek to the start of row 12345 and then scan across the line, counting the commas until I reach the correct cell.

I am about to try this, but has anyone else got any better ideas? I'm sure I wouldn't be the first person to try and deal with a file like this.

Cheers

Edit: I should note that the file contains a very sparse matrix. If parsing the CSV file ends up being too slow, I would consider converting the file to a more appropriate, and easier to process, file format. What is the best way to store a sparse matrix?

like image 739
user593062 Avatar asked Jan 27 '11 23:01

user593062


3 Answers

I have used Lumenworks CSV reader for quite large CSV files, it may be worth a quick look to see how quickly it can parse your file.

Lumenworks CSV

like image 168
PMC Avatar answered Nov 10 '22 16:11

PMC


First of all, how would you want to refer to a particular row? Is it the index of the row so that you have another table or something that will help you know which row you are interested? or is it by an id or something?

These ideas come to mind

  • Your approach
  • Binary search. Assuming you have average length (size/rows), you can use a binary search to find a row assuming there is an identifier in the row which is ordered and can tell you if you are hit or miss.
  • Loading it to a database! By the way, what prevents you to do that? You can even use SQL express - which is free - and to get around the size limit, you can shard your data to multiple databases.
like image 20
Aliostad Avatar answered Nov 10 '22 16:11

Aliostad


Index-file would be the best you could do. I bet. Having unknown size of row, there is no way to skip directly to the line other than either scan the file or have an index.

The only question is how large your index is. If it is too large, you could make it smaller by indexing only every 5th (for example) line and scan in range of 5 lines.

like image 35
František Žiačik Avatar answered Nov 10 '22 17:11

František Žiačik