Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good on-disk "set" implementation for Python?

I'm working on a program in Python that needs to store a persistent "set" data structure containing many fixed-size hash values (SHA256, but that's not important). The critical operations are insert and lookup. Delete is not needed for regular operation. The set will grow over time and eventually may not all fit in memory.

I have considered:

  • a set stored on disk using pickle (slow [several seconds] to write new file to disk, eventually won't fit in memory)
  • a SQLite database (additional dependency not available by default)
  • custom disk-based balanced tree structure, such as B-tree or similar

Ideally, there would be a built-in Python module that provides something that can support these operations. What's a good option here?

After I composed this I found Fast disk-based hashtables? which has some good ideas. I like the mmap/bucket accepted answer there.

(This is for a rewrite of shaback if you're curious.)

like image 586
Greg Hewgill Avatar asked May 19 '11 10:05

Greg Hewgill


1 Answers

Another option is to use shelve, i know it's the same as pickle (under the hood) but i think it's a good option (that i didn't see in your list of options :-)) or maybe if you don't mind using a third party lib you can take a look at shove (it's like a shelve++).

like image 84
mouad Avatar answered Oct 08 '22 04:10

mouad