Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Operations on Large Python Sets

Tags:

python

set

UPDATE 1

Both sets contain strings of maximum length 20 and can only take values from 'abcdefghijklmnopqrstuvwxyz'

UPDATE 2

I constructed the sets by reading 2 files from disk using a library called ujson (similar to simplejson) and then converting the returned lists into sets .


I am trying to take the difference of 2 sets containing 100 million elements each.

This code executes in 2 minutes:

temp = set()                              #O(1)
for i in first_100_million_set:           #O(N)
    temp.add(i)                           #O(1)

This code executes in 6 hours:

temp = set()                              #O(1)
for i in first_100_million_set:           #O(N)
    if i in second_100_million_set:       #O(1)
        temp.add(i)                       #O(1)

All I did was add membership check which , if I am not mistaken is done in O(1)? Where is this massive reduction coming from ?

I know about set(a) - set(b) , it is practically doing exactly what my second block of code is doing , takes 6 hours to complete as well, I just wanted to write the whole procedure to demonstrate my point of confusion.

Do you think there is a better solution for what I am trying to do ?

like image 868
RetroCode Avatar asked Aug 24 '16 22:08

RetroCode


1 Answers

When talking about 100 million element sets, I'd worry about data being evicted from RAM (going to swap/pagefile). A 100M element set on Python 3.5 built for a 64 bit processor (which you're using, because you couldn't even create such a set in a 32 bit build of Python) uses 4 GB of memory just for the set overhead (ignoring the memory used by the objects it contains).

Your code that creates a new set without membership testing the second set accesses this memory sequentially, so the OS can predict the access patterns and it's likely pulling data into the cache before you need it even if most of the set is paged out. The only random access occurs in the building of the second set (but conveniently, the objects being inserted are already in cache because you pulled them from the original set). So you grow from no random access to maybe 4 GB (plus size of contained objects) worth of memory that is being accessed randomly and must not be paged out w/o causing performance problems.

In your second case, the set being membership tested is accessed randomly on every test, and it has to load every object in the bucket collision chain with a matching hash (admittedly, with good hash generation, there shouldn't be too many of these matches). But it means the size of your randomly accessed memory went from growing from 0 to 4 GB to growing from 4 to as much as 8 GB (depending on how much overlap exists between the sets; again, ignoring the access to the stored objects themselves). I wouldn't be surprised if this pushed you from performing mostly RAM accesses to incurring page faults requiring reads from the page file, which is several orders of magnitude slower than RAM access. Not coincidentally, that code is taking a few orders of magnitude longer to execute.

For the record, the set overhead is likely to be a fraction of the cost of the objects stored. The smallest useful objects in Python are floats (24 bytes a piece on Python 3.5 x64) though they're poor choices for sets due to issues with exact equality testing. ints that require less than 30 bits of magnitude are conceivably useful, and eat 28 bytes a piece (add 4 bytes for every full 30 bits required to store the value). So a 100M element set might "only" use 4 GB for the data structure itself, but the values are another 2.6 GB or so minimum; if they're not Python built-in types, user-defined objects, even using __slots__, would at least double this (quintuple it if not using __slots__), before they even pay the RAM for their attributes. I've got 12 GB of RAM on my machine, and your second use case would cause massive page thrashing on it, while your first case would run just fine for a set initialized with range(100000000) (though it would cause most of the other processes to page out; Python with two sets plus the ints shared between them use ~11 GB).

Update: Your data (strings from 1-20 ASCII characters) would use 50-69 bytes each on Python 3.5 x64 (probably a little more including allocator overhead), or 4.65-6.43 GB per set (assuming none of the strings are shared, that's 9-13 GB for the raw data). Add the three sets involved, and you're looking at up to 25 GB of RAM (you don't pay again for the members of the third set since they're shared with the first set). I wouldn't try to run your code on any machine with less than 32 GB of RAM.

As for "is there a better solution?" it depends on what you need. If you don't actually need the original sets, just the resulting difference, streaming your data would help. For example:

with open(file1) as f:
    # Assume one string per line with newlines separating
    myset = set(map(str.rstrip, f))

with open(file2) as f:
    myset.difference_update(map(str.rstrip, f))

That would peak at about 10-11 GB of memory, then drop as elements from the second input were removed, leaving you with just the difference set and nothing else. Other options include using sorted lists of data, which would reduce the overhead from 4 GB per set to ~850 MB per list, then iterate them both in parallel (but not simultaneously; zip is no good here) to find the elements that exist in the first list but not the second, removing some of the random access costs too.

like image 182
ShadowRanger Avatar answered Sep 27 '22 20:09

ShadowRanger