Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Set with only existence check?

I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.

Does such a data structure exist?

done = hash_only_set()
while len(queue) > 0 :
   item = queue.pop()
   if item not in done :
      process(item)
      done.add(item)

(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).

like image 444
Paul Tarjan Avatar asked Aug 26 '09 09:08

Paul Tarjan


1 Answers

It's certainly possible to keep a set of only hashes:

done = set()
while len(queue) > 0 :
   item = queue.pop()
   h = hash(item)
   if h not in done :
      process(item)
      done.add(h)

Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.

If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?

Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.

like image 166
Martin v. Löwis Avatar answered Nov 16 '22 01:11

Martin v. Löwis