Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Should I Implement a Huge but Simple Indexed StringList in Delphi?

I am using Delphi 2009. I have a very simple data structure, with 2 fields:

  1. A string that is the key field I need to retrieve by and is usually 4 to 15 characters in length.
  2. A string that is the data field that may be any size, from 1 character up to say 10,000 characters.

The difficulty is that I may have several million of these records, so they may total as much or more than 10 GB in size. Obviously, I'm looking for an on-disk solution rather than an in-memory solution.

My program needs to randomly retrieve these records, based on the key field. That's that part that needs to be made as efficient as possible.

Should I use a database for such a simple structure, and if so, which database would be best to handle this and be simplest to implement?

Alternatively, is there a simple on-disk data structure not requiring a full-blown database that would work just as well?


Well, all I needed was the one answer to kick me back into reality. I was looking for something simpler than even a simple database. But when the no-duh answer is to use a database, then I realize I've already answered this question with my own answer to another question: Best database for small applications and tools.

My answer was DISQLite3 for the reasons I specified there. And that's what I'll probably with for my implementation.


A few more good answers with some possibilities. That's great. I'll be able to try a few different methods to see what works best.


More contemplation, and I have had to change the accepted answer to the GpStructuredStorage solution.

In my case, a million records totalling several Gigabytes will put a strain on a database structure. Specifically, the B* tree that is used to store the index in most databases is fast, but will slow down for some operations such as reindexing a million values.

The only thing you'll find faster than B* for an index is a hash table. And that is precisely what is provided in gabr's suggested addition to the GpStructuredStorage solution. I think it's quite elegant the way he segmented the hash value to give a 4 level directory structure.

The key reason why I can go to a hash solution is that I only need random access by the keys. I do not need to sort by the keys. If sorting was needed, then the gains of the speed of the hash table would be lost and the database system would be a no-brain winner.

When I get down to implementing this, I should do a comparison of this technique versus a database. Maybe I'll compare with both Firebird and SQLite which would both be worthy opponents.


One more followup:

I just discovered Synopse Big Table by A. Bouchez which is designed for speed and meets the specs of my question almost precisely. I'll be trying it out first when I do my implementation in a few months and will report back here with my results.


A much later followup (July 2015)

I never did get to try Synopse Big Table. I've stuck with my B* tree up to now. But now I've upgraded to Delphi XE8 and plan to go with the database solution using FireDAC with SQLite.

like image 500
lkessler Avatar asked Dec 07 '22 05:12

lkessler


1 Answers

For more than 10GB in data, a database is exactly what you need. It will handle indexing for rapidly locating data (your random retrieval), the functionality for adding, modifying, and deleting data, and the actual storage, as well as much more if you so choose.

There are dozens of posts here related to which databases are available for use in Delphi, including built-ins and FOS ones like Firebird.

like image 69
Ken White Avatar answered Jan 05 '23 18:01

Ken White