I need a faster way to store and access around 3GB of k:v
pairs. Where k
is a string or an integer and v
is an np.array()
that can be of different shapes.
Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame
?
As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?
Lookups are faster in dictionaries because Python implements them using hash tables. If we explain the difference by Big O concepts, dictionaries have constant time complexity, O(1) while lists have linear time complexity, O(n).
defaultdict. defaultdict handles our previous error by building on top of dictionaries. It does not give a KeyError like the regular dictionaries. However, whenever you try to access or “assign to” a key that is not there, it will create that key and give it a default value that was already specified .
Why does the second example using dictionaries run incredibly fast, faster than the first example with lists. the example with dictionaries runs almost thirty-fold faster! I tested these 2 code using n=1000000, and the first code run in 1032 seconds and the second one run in just 3.3 second,,, amazin'!
With CPython 2.7, using dict() to create dictionaries takes up to 6 times longer and involves more memory allocation operations than the literal syntax. Use {} to create dictionaries, especially if you are pre-populating them, unless the literal syntax does not work for your case.
No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )
Once you saved your items in a dictionary, you can have access to them in constant time which means that it's unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.
e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary's keys. Interning is caching the objects in memory --or as in Python, table of "interned" strings-- rather than creating them as a separate object.
Python has provided an intern()
function within the sys
module that you can use for this.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
also ...
If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop
No, I don't think there is anything faster than dict
. The time complexity of its index checking is O(1)
.
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With