Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TStringList, Dynamic Array or Linked List in Delphi?

I have a choice.

I have a number of already ordered strings that I need to store and access. It looks like I can choose between using:

  1. A TStringList

  2. A Dynamic Array of strings, and

  3. A Linked List of strings (singly linked)

    and Alan in his comment suggested I also add to the choices:

  4. TList<string>

In what circumstances is each of these better than the others?

Which is best for small lists (under 10 items)?

Which is best for large lists (over 1000 items)?

Which is best for huge lists (over 1,000,000 items)?

Which is best to minimize memory use?

Which is best to minimize loading time to add extra items on the end?

Which is best to minimize access time for accessing the entire list from first to last?

On this basis (or any others), which data structure would be preferable?

For reference, I am using Delphi 2009.


Dimitry in a comment said:

Describe your task and data access pattern, then it will be possible to give you an exact answer

Okay. I've got a genealogy program with lots of data.

For each person I have a number of events and attributes. I am storing them as short text strings but there are many of them for each person, ranging from 0 to a few hundred. And I've got thousands of people. I don't need random access to them. I only need them associated as a number of strings in a known order attached to each person. This is my case of thousands of "small lists". They take time to load and use memory, and take time to access if I need them all (e.g. to export the entire generated report).

Then I have a few larger lists, e.g. all the names of the sections of my "virtual" treeview, which can have hundreds of thousands of names. Again I only need a list that I can access by index. These are stored separately from the treeview for efficiency, and the treeview retrieves them only as needed. This takes a while to load and is very expensive memory-wise for my program. But I don't have to worry about access time, because only a few are accessed at a time.

Hopefully this gives you an idea of what I'm trying to accomplish.

p.s. I've posted a lot of questions about optimizing Delphi here at StackOverflow. My program reads 25 MB files with 100,000 people and creates data structures and a report and treeview for them in 8 seconds but uses 175 MB of RAM to do so. I'm working to reduce that because I'm aiming to load files with several million people in 32-bit Windows.


I've just found some excellent suggestions for optimizing a TList at this StackOverflow question: Is there a faster TList implementation?

like image 752
lkessler Avatar asked Apr 21 '10 05:04

lkessler


2 Answers

Unless you have special needs, a TStringList is hard to beat because it provides the TStrings interface that many components can use directly. With TStringList.Sorted := True, binary search will be used which means that search will be very quick. You also get object mapping for free, each item can also be associated with a pointer, and you get all the existing methods for marshalling, stream interfaces, comma-text, delimited-text, and so on.

On the other hand, for special needs purposes, if you need to do many inserts and deletions, then something more approaching a linked list would be better. But then search becomes slower, and it is a rare collection of strings indeed that never needs searching. In such situations, some type of hash is often used where a hash is created out of, say, the first 2 bytes of a string (preallocate an array with length 65536, and the first 2 bytes of a string is converted directly into a hash index within that range), and then at that hash location, a linked list is stored with each item key consisting of the remaining bytes in the strings (to save space---the hash index already contains the first two bytes). Then, the initial hash lookup is O(1), and the subsequent insertions and deletions are linked-list-fast. This is a trade-off that can be manipulated, and the levers should be clear.

like image 100
Caleb Hattingh Avatar answered Sep 19 '22 12:09

Caleb Hattingh


  1. A TStringList. Pros: has extended functionality, allowing to dynamically grow, sort, save, load, search, etc. Cons: on large amount of access to the items by the index, Strings[Index] is introducing sensible performance lost (few percents), comparing to access to an array, memory overhead for each item cell.

  2. A Dynamic Array of strings. Pros: combines ability to dynamically grow, as a TStrings, with the fastest access by the index, minimal memory usage from others. Cons: limited standard "string list" functionality.

  3. A Linked List of strings (singly linked). Pros: the linear speed of addition of an item to the list end. Cons: slowest access by the index and searching, limited standard "string list" functionality, memory overhead for "next item" pointer, spead overhead for each item memory allocation.

  4. TList< string >. As above.

  5. TStringBuilder. I does not have a good idea, how to use TStringBuilder as a storage for multiple strings.

Actually, there are much more approaches:

  • linked list of dynamic arrays
  • hash tables
  • databases
  • binary trees
  • etc

The best approach will depend on the task.

Which is best for small lists (under 10 items)?

Anyone, may be even static array with total items count variable.

Which is best for large lists (over 1000 items)? Which is best for huge lists (over 1,000,000 items)?

For large lists I will choose: - dynamic array, if I need a lot of access by the index or search for specific item - hash table, if I need to search by the key - linked list of dynamic arrays, if I need many item appends and no access by the index

Which is best to minimize memory use?

dynamic array will eat less memory. But the question is not about overhead, but about on which number of items this overhead become sensible. And then how to properly handle this number of items.

Which is best to minimize loading time to add extra items on the end?

dynamic array may dynamically grow, but on really large number of items, memory manager may not found a continous memory area. While linked list will work until there is a memory for at least a cell, but for cost of memory allocation for each item. The mixed approach - linked list of dynamic arrays should work.

Which is best to minimize access time for accessing the entire list from first to last?

dynamic array.

On this basis (or any others), which data structure would be preferable?

For which task ?

like image 36
da-soft Avatar answered Sep 21 '22 12:09

da-soft