I'm looking for an efficient data structure to represent a very large matrix of integers in Python/Cython with focus on element-wise operations.
I'm currently building a model that requires a lot of element-wise operations on a large, highly sparse matrix (ca. 50bn read/writes on a 2MMx500k matrix). Previously I've run experiments on smaller data and used Python with Cython and Numpy arrays, and would ideally like to continue using some parts of the existing infrastructure.
A number of options I've looked at / implemented so far. They may not be fully optimised but all implementations should be good enough to give a realistic idea of the potential of each approach. I've tested by creating a 2MMx500k matrix, adding 25MM elements and then again removing them. This reflects the kind of operations I would need.
29 minutes: Cython lists to fill scipy.sparse.coo -> sparse.dok
10 minutes: Cython lists to fill scipy.sparse.coo -> sparse.lil
3 minutes: Dict, s.t. A["%d_%d" % (i,j)] contains M[i][j]
3 minutes: Dict, s.t. A[(i,j)] contains M[i][j]
<1 minute: Dict, s.t. A[i*N,j] contains M[i][j]
<1 minute: <std::map> using Cython
Hacking a dict together performs best so far, but is still fairly slow. Also it feels like too much of a hack, so I'm assuming there must be a more efficient method out there, particularly considering the dict solution doesn't really use any of the potential Cython could provide. Is there a more Cythonic solution around for this? Google wasn't very helpful unfortunately (or I was lacking the correct keywords to search with).
Any suggestions on how to do this would be greatly appreciated!
Edit 1
The difference between the two dictionary solutions is that the A["%d_%d" % (i,j)] variant is faster for access, whereas the A[(i,j)] variant is faster for setup.
Setup Execution
Dict, s.t. A["%d_%d" % (i,j)] contains M[i][j] 180s 30s
Dict, s.t. A[(i,j)] contains M[i][j] 66s 104s
Dict, s.t. Dict, s.t. A[i*N,j] contains M[i][j] 40s 8s
While the A["%d_%d" % (i,j)] is marginally slower in the current test, it would be preferable in the long run as the setup cost will only increase by a factor 10x, the execution cost by a factor 10,000x for my actual experiments.
Edit 2
I could speed up the dictionary version further by removing the string operations and instead using a large integer representation by concatenating the two indices using a suitably large multiplier on the first to avoid collision. The multiplication is typed using a cdef:
cdef unsigned int key(int a, int b): return a * 10000000 + b
It might still be possible to further improve speed by optimising the dict or moving the data structure to C, but this should be fast enough for my purposes. Any other suggestions would still be very welcome, though! Will report back if I find a more efficient solution using stl map or similar data structures.
Edit 3
Following suggestions of a colleague I've also implemented the system using <std::map>
via it's Cython interface to store data in an <int,int>
map. The implementation is actually mildly slower than the dict
implementation once we have large amounts of data, with the key difference being in access speed:
Small data (25MM elements) Large data (250MM elements)
Time total setup read/write total setup read/write
Dict[int keys] 40s 25s 8s 369s 269s 72s
<std::map> 26s 11s 12s 376s 169s 177s
If you aren't doing row- or column-based accesses, you can use a dict with a tuple key like A[(i,j)]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With