Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# - Large collection storage

I'm currently facing a head-scratching problem, I am working with a large data set (when I say large, I mean billions of rows of data) and I am caught between speed and scalability.

I can store the billions of rows of data in the database, but my application needs to constantly check whether a new row of data exists in the dataset, if not, insert it, otherwise, retrieve it.

If I were to use a database solution, I estimate each call to the database to retrieve a row of data to be 10ms (optimistic estimate), I need to retrieve about 800k records for each file that I process in my application, that means (10ms x 800k = 2.22 hours) per file to process. That timespan is too long to analyse and process 1 file, considering the amount of time required to retrieve a row of data from the database will increase when the database grow to billions and billions of rows.

I have also thought of storing a List or HashSet in the local memory to compare and retrieve, but it is not going to work out as I will not be able to store billions of records (objects) in the memory.

Pls advice on what I should do for my situation.

Edit: Oh ya, I forgotten to state that I have already implemented a semi-cache, once a record is retrieved, it will be cached in the memory, so if the same record needs to be retrieved again, it will be retrieved from the memory instead, but I face the same problem, I will reach a point in time where the memory can no longer fit any more cached data.

like image 323
Dan Avatar asked Dec 10 '14 06:12

Dan


1 Answers

Ideally if you are playing with a large number of data then you have to make sure that you do not run out of resources while processing your data. However you just need to find a reasonable way to increase the utilization of your resources.

I would definitely go with database because that is the best known way to query and storing of the data in most optimal way. You didn't mention what exactly your application does, so I can just give you general opinions about how I would do in such scenario;

  1. If data size of your database is really big as you say in billions and if you data being read for analytic or reporting purposes you better find a data mining technique like cubes etc. This would help you to structure your data in a way to reduce the query time.
  2. If above is not an option find a way to horizontally or vertically partitioning your data, well it also depends upon how you actually retrieving the data and how you can really group them together.
  3. Find a way to query a group of rows (e.g. where pk in (1,2,3,4,...,100) instead querying each row at time as you mentioned earlier, grouping may increase the query response in exponential way.
  4. Its best to find a primary key within data itself so that your data will be sorted in order of your primary key physically and you will know your primary key before even inserting it. However if you are not querying with primary key then better to place reason-able indexes to increase query response time.
  5. Keep the database connection open for life of your application and reconnect only if dropped. and use connection pool if multiple connections to database is expected.
like image 161
Mubashar Avatar answered Oct 31 '22 06:10

Mubashar