Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with very large Dictionary<> in C#

I am implementing a type of search (TF-IDF) in which every word has a score calculated that is proportional to all the documents being searched. I have 100GB of documents to search.

If I was working with 1GB of documents, I would use:

Dictionary<string, List<Document>>

..where string is the word and List<Document> is all documents, ranked in order, containing that word. This does not scale up. I am using a Dictionary<> because lookup time is O(1) (in theory).

My intended solution is a SQLServer database in which words are listed in a table, with the relevant List object stored serialized. My concern is that reading the DB and rebuilding to List<> each time will be very inefficient.

Am I going in the wrong direction here? What is a normal solution to working with huge dictionaries?

like image 204
user2450099 Avatar asked Feb 15 '23 09:02

user2450099


1 Answers

You are right in saying that using a List would be inefficient, on average a List would achieve linear output (O(n)).

Personally, I would use a Concurrent Dictionary which is guaranteed to be O(1). During one of the project's I've worked on I was working with large files 100MB text files and I found a Concurrent Dictionary could sufficiently sort and search information completing an estimated 10,000ish give or take records every second.

Have a look at this neat cheat sheet. for Big-Oh algorithms it provides some neat details for best and worse case scenarios. When dealing with massive data sets its important to keep the concepts of Abstraction and Decomposition in mind.

Abstraction Concentrate on the most important elements - ignore irrelevant details

Only store information that is important, I highly doubt you will need an entire 1GB file to be in memory.

Decomposition Divide and Conquer

Ensure the desktop running your application has a good latency to your database. I would recommend only storing what you need in memory and using LINQ to retrieve only the exact information you need, once you have the information that is relevant to your task... you can then filter it further.

like image 104
classicjonesynz Avatar answered Feb 18 '23 16:02

classicjonesynz