I have a problem: I need space-efficient lookup of file-system data based of file path prefix. Prefix searching of sorted text, in other words. Use a trie, you say, and I thought the same thing. Trouble is, tries are not space-efficient enough, not without other tricks.
I have a fair amount of data:
I don't want to be eating anywhere close to 450M in memory. At this point I'd be happy to be using somewhere around 100M, since there's lots of redundancy in the form of prefixes.
I'm using C# for this job, and a straightforward implementation of a trie will still require one leaf node for every line in the file. Given that every leaf node will require some kind of reference to the final chunk of text (32 bits, say an index into an array of string data to minimize string duplication), and CLR object overhead is 8 bytes (verified using windbg / SOS), I'll be spending >96,000,000 bytes in structural overhead with no text storage at all.
Let's look at some of the statistical attributes of the data. When stuffed in a trie:
Excess rates of leaf creation is about 15%, excess interior node creation is 22% - by excess creation, I mean leaves and interior nodes created during trie construction but not in the final trie as a proportion of the final number of nodes of each type.
Here's a heap analysis from SOS, indicating where the most memory is getting used:
[MT ]--[Count]----[ Size]-[Class ]
03563150 11 1584 System.Collections.Hashtable+bucket[]
03561630 24 4636 System.Char[]
03563470 8 6000 System.Byte[]
00193558 425 74788 Free
00984ac8 14457 462624 MiniList`1+<GetEnumerator>d__0[[StringTrie+Node]]
03562b9c 6 11573372 System.Int32[]
*009835a0 1456066 23297056 StringTrie+InteriorNode
035576dc 1 46292000 Dictionary`2+Entry[[String],[Int32]][]
*035341d0 1456085 69730164 System.Object[]
*03560a00 1747257 80435032 System.String
*00983a54 8052746 96632952 StringTrie+LeafNode
The Dictionary<string,int>
is being used to map string chunks to indexes into a List<string>
, and can be discarded after trie construction, though GC doesn't seem to be removing it (a couple of explicit collections were done before this dump) - !gcroot
in SOS doesn't indicate any roots, but I anticipate that a later GC would free it.
MiniList<T>
is a replacement for List<T>
using a precisely-sized (i.e. linear growth, O(n^2)
addition performance) T[]
to avoid space wastage; it's a value type and is used by InteriorNode
to track children. This T[]
is added to the System.Object[]
pile.
So, if I tot up the "interesting" items (marked with *
), I get about 270M, which is better than raw text on disk, but still not close enough to my goal. I figured that .NET object overhead was too much, and created a new "slim" trie, using just value-type arrays to store data:
class SlimTrie
{
byte[] _stringData; // UTF8-encoded, 7-bit-encoded-length prefixed string data
// indexed by _interiorChildIndex[n].._interiorChildIndex[n]+_interiorChildCount[n]
// Indexes interior_node_index if negative (bitwise complement),
// leaf_node_group if positive.
int[] _interiorChildren;
// The interior_node_index group - all arrays use same index.
byte[] _interiorChildCount;
int[] _interiorChildIndex; // indexes _interiorChildren
int[] _interiorChunk; // indexes _stringData
// The leaf_node_index group.
int[] _leafNodes; // indexes _stringData
// ...
}
This structure has brought down the amount of data to 139M, and is still an efficiently traversable trie for read-only operations. And because it's so simple, I can trivially save it to disk and restore it to avoid the cost of recreating the trie every time.
So, any suggestions for more efficient structures for prefix search than trie? Alternative approaches I should consider?
Tries is a tree that stores strings. The maximum number of children of a node is equal to the size of the alphabet. Trie supports search, insert and delete operations in O(L) time where L is the length of the key. Hashing:- In hashing, we convert the key to a small value and the value is used to index data.
In computer science, a trie, also called digital tree or prefix tree, is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters.
A Trie is a special data structure used to store strings that can be visualized like a graph. It consists of nodes and edges. Each node consists of at max 26 children and edges connect each parent node to its children.
Trie DB is the persistent storage. Two options are available to store the data: Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
Since there are only 1.1 million chunks, you can index a chunk using 24 bits instead of 32 bits and save space there.
You could also compress the chunks. Perhaps Huffman coding is a good choice. I would also try the following strategy: instead of using a character as a symbol to encode, you should encode character transitions. So instead of looking at the probability of a character appearing, look at the probability of the transition in a Markov chain where the state is the current character.
You can find a scientific paper connected to your problem here (citation of the authors: "Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi." - but unfortunately the paper is payment only). I'm not sure how difficult these ideas are to implement. The authors of this paper have a website where you can find also implementations (under "Index Collection") of various compressed index algorithms.
If you want to go on with your approach, make sure to check out the websites about Crit-bit trees and Radix tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With