Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:

1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?

2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)

like image 480
user1981003 Avatar asked Jan 15 '13 16:01

user1981003


1 Answers

If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.

For instance, you could load the data into a dictionary, like this:

Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
    parser.TextFieldType = FieldType.Delimited;
    parser.SetDelimiters(",");
    while (!parser.EndOfData)
    {
        try
        {
            string[] fields = parser.ReadFields();
            data[fields[0]] = fields;
        }
        catch (MalformedLineException ex)
        {
            // ...
        }
    }
}

And then you could get the data for any item, like this:

string fields[] = data["key I'm looking for"];
like image 98
Steven Doggart Avatar answered Oct 13 '22 10:10

Steven Doggart