Data structure options for efficiently storing sets of integer pairs on disk?

Tags:

I have a bunch of code that deals with document clustering. One step involves calculating the similarity (for some unimportant definition of "similar") of every document to every other document in a given corpus, and storing the similarities for later use. The similarities are bucketed, and I don't care what the specific similarity is for purposes of my analysis, just what bucket it's in. For example, if documents 15378 and 3278 are 52% similar, the ordered pair (3278, 15378) gets stored in the [0.5,0.6) bucket. Documents sometimes get either added or removed from the corpus after initial analysis, so corresponding pairs get added to or removed from the buckets as needed.

I'm looking at strategies for storing these lists of ID pairs. We found a SQL database (where most of our other data for this project lives) to be too slow and too large disk-space-wise for our purposes, so at the moment we store each bucket as a compressed list of integers on disk (originally zlib-compressed, but now using lz4 instead for speed). Things I like about this:

Reading and writing are both quite fast
After-the-fact additions to the corpus are fairly straightforward to add (a bit less so for lz4 than for zlib because lz4 doesn't have a framing mechanism built in, but doable)
At both write and read time, data can be streamed so it doesn't need to be held in memory all at once, which would be prohibitive given the size of our corpora

Things that kind of suck:

Deletes are a huge pain, and basically involve streaming through all the buckets and writing out new ones that omit any pairs that contain the ID of a document that's been deleted
I suspect I could still do better both in terms of speed and compactness with a more special-purpose data structure and/or compression strategy

So: what kinds of data structures should I be looking at? I suspect that the right answer is some kind of exotic succinct data structure, but this isn't a space I know very well. Also, if it matters: all of the document IDs are unsigned 32-bit ints, and the current code that handles this data is written in C, as Python extensions, so that's probably the general technology family we'll stick with if possible.

456

asked May 07 '13 18:05

Andrew Pendleton

1 Answers

How about using one hash table or B-tree per bucket?

On-disk hashtables are standard. Maybe the BerkeleyDB libraries (availabe in stock python) will work for you; but be advised that they since they come with transactions they can be slow, and may require some tuning. There are a number of choices: gdbm, tdb that you should all give a try. Just make sure you check out the API and initialize them with appropriate size. Some will not resize automatically, and if you feed them too much data their performance just drops a lot.

Anyway, you may want to use something even more low-level, without transactions, if you have a lot of changes.

A pair of ints is a long - and most databases should accept a long as a key; in fact many will accept arbitrary byte sequences as keys.

170

answered Sep 23 '22 04:09

Erich Schubert

Related questions
                            
                                How to set a breakpoint on a python function in gdb
                            
                                Multiplying Numpy 3D arrays by 1D arrays
                            
                                Using middleware to prevent scrapy from double-visiting websites
                            
                                Django with Gunicorn different ways to deploy
                            
                                Error when trying to install pip on openSUSE
                            
                                Python / Scipy filter discretization
                            
                                how to write setup.py for this application structure?
                            
                                django ajax request
                            
                                Python minimize function: passing additional arguments to constraint dictionary
                            
                                Determine if __array_wrap__ on ndarray subclass is called from unary ufunc or binary ufunc
                            
                                Accessing Google Drive from a Google App Engine Python app
                            
                                py2exe doesn't find all dependencies
                            
                                How do I prevent users from importing x from a submodule when it exposed in the parent using __all__
                            
                                Returning custom HTTP error reasons in Google Cloud Endpoints
                            
                                Error 10049 on Windows Python Multicast
                            
                                Flask-Mail - Connection Refused [Errno 111]
                            
                                Python Popen on Windows with multithreading - can't delete stdout/stderr logs
                            
                                QScintilla in PySide
                            
                                Fastest way to create and fill huge numpy 2D-array?
                            
                                how to fix https openid error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data structure options for efficiently storing sets of integer pairs on disk?

Tags:

python

c

integer

data-structures

Andrew Pendleton

People also ask

1 Answers

Erich Schubert

Recent Activity

Donate For Us