I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.
I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:
Things that kind of suck:
So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.
Explanation: Stack data structure is most suitable to implement redo-undo feature. This is because the stack is implemented with LIFO(last in first out) order which is equivalent to redo-undo feature i.e. the last re-do is undo first.
A list is a data structure in Python that is a mutable, or changeable, ordered sequence of elements. Each element or value that is inside of a list is called an item. Just as strings are defined as characters between quotes, lists are defined by having values between square brackets [ ] .
The basic Python data structures in Python include list, set, tuples, and dictionary. Each of the data structures is unique in its own way. Data structures are “containers” that organize and group data according to type. The data structures differ based on mutability and order.
How about using one hash table or B-tree per bucket?
On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.
Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.
A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With