I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With