I need to store a large list of numbers in memory. I will then need to check for membership. Arrays are better than lists for memory efficiency. Sets are better than lists for membership checking. I need both! So my questions are:
1) How much more memory efficient are arrays than sets? (For the converse, see my results below). 2) Is there a data structure which strikes a better balance between sets and arrays? Something like a set with a signed integer type? Or some numpy construct?
I checked out the membership timing difference with the script below. (I know timeit is better but the variance is low enough for time to be fine):
import array
import time
class TimerContext:
def __enter__(self):
self.t0 = time.time()
def __exit__(self, *args, **kwargs):
print(time.time()-self.t0)
SIZE = 1000000
l = list([i for i in range(SIZE)])
a = array.array('I', l)
s = set(l)
print(type(l))
print(type(a))
print(type(s))
with TimerContext():
x = 99999 in l
with TimerContext():
x = 99999 in a
with TimerContext():
x = 99999 in s
Results:
<class 'list'>
<class 'array.array'>
<class 'set'>
0.0012176036834716797
0.0024595260620117188
1.430511474609375e-06
So sets are a LOT faster for membership checking (please note the scientific notation). So if their memory footprint isn't that different to the array, I'll prefer to use a set. But I don't know how to check the memory footprint.
I should also add that there are a lot of questions comparing sets and lists. But I didn't see any good answers comparing arrays and sets.
Difference Between Array and SetArray is faster than set in terms of initialization. Set is slower than an array in terms of initialization because it uses a hash process. The array allows to store duplicate elements in it. Set doesn't allow to store duplicate elements in it.
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1).
Lists are slightly faster than sets when you just want to iterate over the values. Sets, however, are significantly faster than lists if you want to check if an item is contained within it. They can only contain unique items though.
The tuples refer to the collections of various objects of Python separated by commas between them. The sets are an unordered collection of data types. These are mutable, iterable, and do not consist of any duplicate elements.
If it's possible in your case, bisect
performance comes close to set
for membership checks (both with list and array). See results below
import array
from bisect import bisect
import sys
import time
class TimerContext:
def __enter__(self):
self.t0 = time.time()
def __exit__(self, *args, **kwargs):
print(time.time() - self.t0)
def get_size_in_megabytes(iterable):
return round(sys.getsizeof(iterable) / (1024 ** 2), 2)
SIZE = 1000000
l = list([i for i in range(SIZE)])
a = array.array("I", l)
s = set(l)
print(type(l), get_size_in_megabytes(l))
print(type(a), get_size_in_megabytes(a))
print(type(s), get_size_in_megabytes(s))
with TimerContext():
x = 99999 in l
with TimerContext():
x = 99999 in a
with TimerContext():
x = 99999 in s
print("list bisect")
with TimerContext():
bisect(l, 99999)
print("array bisect")
with TimerContext():
bisect(a, 99999)
Results:
<class 'list'> 8.58
<class 'array.array'> 3.81
<class 'set'> 32.0
0.0024390220642089844
0.0053005218505859375
3.814697265625e-06
list bisect
9.298324584960938e-06
array bisect
6.198883056640625e-06
Credits for sys.getsizeof
ussage to @CristiFati.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With