I am implementing a type of search (TF-IDF) in which every word has a score calculated that is proportional to all the documents being searched. I have 100GB of documents to search.
If I was working with 1GB of documents, I would use:
Dictionary<string, List<Document>>
..where string
is the word and List<Document>
is all documents, ranked in order, containing that word. This does not scale up. I am using a Dictionary<>
because lookup time is O(1) (in theory).
My intended solution is a SQLServer database in which words are listed in a table, with the relevant List object stored serialized. My concern is that reading the DB and rebuilding to List<>
each time will be very inefficient.
Am I going in the wrong direction here? What is a normal solution to working with huge dictionaries?
You are right in saying that using a List
would be inefficient, on average a List
would achieve linear output (O(n)
).
Personally, I would use a Concurrent Dictionary
which is guaranteed to be O(1)
. During one of the project's I've worked on I was working with large files 100MB text files and I found a Concurrent Dictionary
could sufficiently sort and search information completing an estimated 10,000ish give or take records every second.
Have a look at this neat cheat sheet. for Big-Oh algorithms it provides some neat details for best and worse case scenarios. When dealing with massive data sets its important to keep the concepts of Abstraction
and Decomposition
in mind.
Abstraction Concentrate on the most important elements - ignore irrelevant details
Only store information that is important, I highly doubt you will need an entire 1GB file to be in memory.
Decomposition Divide and Conquer
Ensure the desktop running your application has a good latency to your database. I would recommend only storing what you need in memory and using LINQ to retrieve only the exact information you need, once you have the information that is relevant to your task... you can then filter it further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With