UPDATE 1
Both sets contain strings of maximum length 20 and can only take values from 'abcdefghijklmnopqrstuvwxyz'
UPDATE 2
I constructed the sets by reading 2 files from disk using a library called ujson (similar to simplejson) and then converting the returned lists into sets .
I am trying to take the difference of 2 sets containing 100 million elements each.
This code executes in 2 minutes:
temp = set() #O(1)
for i in first_100_million_set: #O(N)
temp.add(i) #O(1)
This code executes in 6 hours:
temp = set() #O(1)
for i in first_100_million_set: #O(N)
if i in second_100_million_set: #O(1)
temp.add(i) #O(1)
All I did was add membership check which , if I am not mistaken is done in O(1)? Where is this massive reduction coming from ?
I know about set(a) - set(b) , it is practically doing exactly what my second block of code is doing , takes 6 hours to complete as well, I just wanted to write the whole procedure to demonstrate my point of confusion.
Do you think there is a better solution for what I am trying to do ?
When talking about 100 million element sets, I'd worry about data being evicted from RAM (going to swap/pagefile). A 100M element set
on Python 3.5 built for a 64 bit processor (which you're using, because you couldn't even create such a set
in a 32 bit build of Python) uses 4 GB of memory just for the set
overhead (ignoring the memory used by the objects it contains).
Your code that creates a new set
without membership testing the second set
accesses this memory sequentially, so the OS can predict the access patterns and it's likely pulling data into the cache before you need it even if most of the set
is paged out. The only random access occurs in the building of the second set
(but conveniently, the objects being inserted are already in cache because you pulled them from the original set
). So you grow from no random access to maybe 4 GB (plus size of contained objects) worth of memory that is being accessed randomly and must not be paged out w/o causing performance problems.
In your second case, the set
being membership tested is accessed randomly on every test, and it has to load every object in the bucket collision chain with a matching hash (admittedly, with good hash generation, there shouldn't be too many of these matches). But it means the size of your randomly accessed memory went from growing from 0 to 4 GB to growing from 4 to as much as 8 GB (depending on how much overlap exists between the set
s; again, ignoring the access to the stored objects themselves). I wouldn't be surprised if this pushed you from performing mostly RAM accesses to incurring page faults requiring reads from the page file, which is several orders of magnitude slower than RAM access. Not coincidentally, that code is taking a few orders of magnitude longer to execute.
For the record, the set
overhead is likely to be a fraction of the cost of the objects stored. The smallest useful objects in Python are float
s (24 bytes a piece on Python 3.5 x64) though they're poor choices for set
s due to issues with exact equality testing. int
s that require less than 30 bits of magnitude are conceivably useful, and eat 28 bytes a piece (add 4 bytes for every full 30 bits required to store the value). So a 100M element set might "only" use 4 GB for the data structure itself, but the values are another 2.6 GB or so minimum; if they're not Python built-in types, user-defined objects, even using __slots__
, would at least double this (quintuple it if not using __slots__
), before they even pay the RAM for their attributes. I've got 12 GB of RAM on my machine, and your second use case would cause massive page thrashing on it, while your first case would run just fine for a set
initialized with range(100000000)
(though it would cause most of the other processes to page out; Python with two set
s plus the int
s shared between them use ~11 GB).
Update: Your data (strings from 1-20 ASCII characters) would use 50-69 bytes each on Python 3.5 x64 (probably a little more including allocator overhead), or 4.65-6.43 GB per set
(assuming none of the strings are shared, that's 9-13 GB for the raw data). Add the three set
s involved, and you're looking at up to 25 GB of RAM (you don't pay again for the members of the third set
since they're shared with the first set
). I wouldn't try to run your code on any machine with less than 32 GB of RAM.
As for "is there a better solution?" it depends on what you need. If you don't actually need the original set
s, just the resulting difference, streaming your data would help. For example:
with open(file1) as f:
# Assume one string per line with newlines separating
myset = set(map(str.rstrip, f))
with open(file2) as f:
myset.difference_update(map(str.rstrip, f))
That would peak at about 10-11 GB of memory, then drop as elements from the second input were removed, leaving you with just the difference set
and nothing else. Other options include using sorted list
s of data, which would reduce the overhead from 4 GB per set
to ~850 MB per list
, then iterate them both in parallel (but not simultaneously; zip
is no good here) to find the elements that exist in the first list
but not the second, removing some of the random access costs too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With