Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python very large set. How to avoid out of memory exception?

I use a Python set collection to store unique objects. Every object has __hash__ and __eq__ overridden.

The set contains near 200 000 objects. The set itself takes near 4 GB of memory. It works fine on machine with more than 5 GB, but now I got a need to run the script on a machine that has only 3 GB of RAM available.

I rewrote a script to C# - actually read the same data from the same source, put it to a CLR analogue of set (HashSet) and instead of 4 GB it took near 350 MB while the speed of the script execution was relatively the same (near 40 seconds) But I have to use Python.

Q1: Does Python have any "disk persistent" set or any other workaround? I guess that it can store in memory only "key" data used in hash/eq methods and everything else can be persisted to disk. Or maybe there are other workarounds in Python to have a unique collection of objects that may take more memory than available in the system.

Q2: less practical question: why does python set takes so much more memory for a set?

I use standard Python 2.7.3 on 64 bit Ubuntu 12.10

Thank you.

Update1: What script does:

  1. Read a lot of semi-structured JSON documents (each JSON consist of serialized object with collection of aggregated objects related to it)

  2. Parse each JSON document to retrieve from it the main object and the objects from aggregated collections. Every parsed object is stored to a set. Set is used to store unique objects only. Firstly I used a database, but unique constraint in database works x100-x1000 slower. Every JSON document is parsed to 1-8 different object types. Each object type is stored in it's own set to save in memory only unique objects.

  3. All data stored in sets is saved to relational database with unique constraints. Each set is stored in separate database table.

The whole idea of the script to take unstructured data, remove duplicates from aggregated object collections in the JSON document and store the structured data to relational database.

Update 2:

2 delnan: I commented all lines of code with adding to a different sets keeping all other staff (getting data, parsing, iterating) the same - The script took 4 GB less memory.

It means that when those 200K objects are added to sets - they start taking so much memory. The object is a simple movie data from TMDB - ID, a list of genres, a list of actors, directors, a lot of other movie details and possibly large movie description from Wikipedia.

like image 816
Zelid Avatar asked Mar 21 '13 18:03

Zelid


People also ask

How do I avoid out of memory error in Python?

the MemoryError in Python. A memory error is raised when a Python script fills all the available memory in a computer system. One of the most obvious ways to fix this issue is to increase the machine's RAM .

How do I increase the memory limit in Python?

Python doesn't limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.

How much memory does a Python set use?

Those numbers can easily fit in a 64-bit integer, so one would hope Python would store those million integers in no more than ~8MB: a million 8-byte objects. In fact, Python uses more like 35MB of RAM to store these numbers.


3 Answers

The best approach would probably be to make the objects that you are storing in the set smaller. If they contain unnecessary fields, remove them.

To reduce the general object overhead you could also use __slots__ to declare the used fields:

class Person(object):
   __slots__ = ['name', 'age']
   def __init__(self):
      self.name = 'jack'
      self.age = 99
like image 138
sth Avatar answered Oct 19 '22 06:10

sth


Sets indeed use a lot of memory, but lists don't.

>>> from sys import getsizeof
>>> a = range(100)
>>> b = set(a)
>>> getsizeof(a)
872
>>> getsizeof(b)
8424
>>>

If the only reason why you use a set is to prevent duplicates, I would advise you to use a list instead. You can prevent duplicates by testing if objects are already in your list before adding them. It might be slower than using the built-in mechanics of sets, but it would surely use a lot less memory.

like image 31
Fury Avatar answered Oct 19 '22 05:10

Fury


Try using __slots__ to reduce your memory usage.

When I last had this problem with lots and lots of objects, using __slots__ reduces the memory usage to 1/3.

Here is a SO question about __slots__ you might find interesting.

like image 45
Nick Craig-Wood Avatar answered Oct 19 '22 06:10

Nick Craig-Wood