I would like a data structure that stores lots of pieces of low-entropy data that are often similar to each other. I want to store them efficiently (compressed somehow) and retrieved by index or match. Quick retrieval is more important than compression, but it is not an option to store them uncompressed.
The best example I can think of is storing a billion written sentences taken from volumes of texts (in a compressed form on disk).
dict:
1: 'The quick brown fox jumps over the lazy dog.'
2: 'The quick green frog jumps over the lazy fox.'
3: 'The quick brown fox jumps over the lazy frog.'
If two sentences are the same they should have the same index.
I want to retrieve them by either index or wildcard match (regex is nice too, but not necessary). ie:
dict.get(1) => 'The quick brown fox jumps over the lazy dog.'
dict.match('The quick brown *') => [1, 3]
I could compress each sentence, but that neglects the fact that many entries are similar.
I could sort them and store the diffs. but that is very difficult to add and remove elements.
It should support unicode.
I'm sure there is some tree structure out there that does this.
Bonus points if it has a python wrapper.
This https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/ looks very close but hasn't seen action since 2002/py2.2 and I could not get it to run. If there are newer/better alternatives to check out, I'd love to hear about them.
I am including the bioinformatics tag because I understand that suffix_trees and similar data structures are used there.
As you already pointed, a suffix tree or a radix tree is probably the way to go. I'd suggest:
Creating a radix tree, storing the ids in the leaves. Check the links in this answer for a start, but I believe you'll have to fine tune whatever you find to fit your needs;
Creating a dict mapping ids to paths in the tree. This will allow you to quicky retrieve sentences by id (find the path, follow it to mount the sentence). Note that this will make inserts and deletes a bit costly: every time a non-leaf node is changed, every descendant will need to have their paths updated in the dict;
2.1. An alternative (in case the paths end up too long) is to have each node store a reference to its parent, so the dict only need to refer to the leaf node. I believe most implementations out there don't do it, since the main goal of tries is to speed up lookup, not to compress the text itself.
The wildcard search is a bit tricky, depending on the complexity of your needs. The example provided is simple: follow the nodes for the prefix, until the wildcard is found, then return all descendants. In this case, a general trie might be easier to deal than the more specialized radix tree, but the requirements in space are a bit higher.
By the way, you could also optimize your radix trie to take less space, by using some indirection by interning the strings in the nodes, and adding extra nodes for long, common substrings. Example:
unique_strings = [ # Not a real array, just an hypothetical "intern table"
"The quick ",
"brown fox ",
"green frog ",
"jumps over the lazy ",
"dog.",
"fox.",
"frog.",
]
radix_trie = (0, { # The quick *
"b":(1, { # The quick brown fox *
"j":(3, { # The quick brown fox jumps over the lazy *
"d":(4,{},1), # The quick brown fox jumps over the lazy dog.
"f":(6,{},3), # The quick brown fox jumps over the lazy frog.
}),
}),
"g":(2, { # The quick green frog *
"j":(3, { # The quick green frog jumps over the lazy *
"f":(5,{},2), # The quick green frog jumps over the lazy fox.
}),
}),
})
# The nodes ("b", "j") and ("g", "j") wouldn't occur in a regular radix tree,
# since they have no siblings. Adding them, however, gives a net gain of space.
#
# "jumps over the lazy " is a common substring of
# "brown fox jumps over the lazy " and
# "green frog jumps over the lazy fox."
# which would occur naturally in a radix tree with only the 3 sentences given.
paths = {
1:("b", "j", "d"),
2:("g", "j", "f"),
3:("b", "j", "f"),
}
Of course, for your example this was easy to set up, but finding repeating substrings "in the wild" will be a bit trickier. (find long common substrings in any pair of strings: very expensive operation doable, see update) However, assuming that inserts/deletes are an infrequent operation, that shouldn't be a big problem.
Note: I'm suggesting a radix tree instead of a trie because the space requirements for the former are much smaller.
Update: just in case you're planning on tackling the problem by yourself, here's one more hint for compressing your data using a radix tree: according to Wikipedia's article on longest common substring, you can build a generalised suffix tree and use it to find common substrings of two or more strings (it also mention it's mostly used in bioinformatics). Creating one for your radix tree's nodes (or, at least, the ones above a certain size) you can find cases where you want to split them in smaller nodes.
Using your example, the "regular" (without lone children) radix tree would be:
radix_tree = ("The quick ", {
"b":("brown fox jumps over the lazy ", {
"d":("dog.",{},1),
"f":("frog.",{},3),
}),
"g":("green frog jumps over the lazy fox.", {}, 2),
})
which clearly doesn't do a great job at compressing your text. But after creating a suffix tree for the set of words in each node, it becomes clear that " jumps over the lazy "
is a good candidate to be interned and reused in two or more nodes (resulting in the example I showed earlier). The space saved will always be (string_length - (1..2)*sizeof_node) * num_nodes
(1 for preffixes/suffixes, 2 for rest), so short strings don't need to be considered at all when doing this optimization.
Complex, yes, and as Adam Mihalcin noted, a pure Python solution will likely be too costly to store a very large dataset. But in case there's no ready-made solution out there, this is what I'd attempt first...
Your problem sounds exactly like the use case for a trie, which is a tree-based data structure to store strings by prefix. I haven't used these implementations myself, but a quick search on Google code reveals open source trie projects here and here and here. The first two are in Java and the third is in C++. I would expect writing a wrapper around C++ for Python would be easier than writing a wrapper around Java, since Python has built-in capabilities for interoperability with C.
EDIT
I have checked GitHub, and had a little more success with Python implementations. I have found Python trie implementations here and here and here.
However, if you are really working with a billion sentences, even a very well-written pure Python implementation (as all three of these are) may run out of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With