Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient structure for element wise access to very large sparse matrix (Python/Cython)

I'm looking for an efficient data structure to represent a very large matrix of integers in Python/Cython with focus on element-wise operations.

I'm currently building a model that requires a lot of element-wise operations on a large, highly sparse matrix (ca. 50bn read/writes on a 2MMx500k matrix). Previously I've run experiments on smaller data and used Python with Cython and Numpy arrays, and would ideally like to continue using some parts of the existing infrastructure.

A number of options I've looked at / implemented so far. They may not be fully optimised but all implementations should be good enough to give a realistic idea of the potential of each approach. I've tested by creating a 2MMx500k matrix, adding 25MM elements and then again removing them. This reflects the kind of operations I would need.

29 minutes: Cython lists to fill scipy.sparse.coo -> sparse.dok 
10 minutes: Cython lists to fill scipy.sparse.coo -> sparse.lil 
 3 minutes: Dict, s.t. A["%d_%d" % (i,j)] contains M[i][j]
 3 minutes: Dict, s.t. A[(i,j)] contains M[i][j]
<1 minute:  Dict, s.t. A[i*N,j] contains M[i][j]
<1 minute:  <std::map> using Cython

Hacking a dict together performs best so far, but is still fairly slow. Also it feels like too much of a hack, so I'm assuming there must be a more efficient method out there, particularly considering the dict solution doesn't really use any of the potential Cython could provide. Is there a more Cythonic solution around for this? Google wasn't very helpful unfortunately (or I was lacking the correct keywords to search with).

Any suggestions on how to do this would be greatly appreciated!

Edit 1

The difference between the two dictionary solutions is that the A["%d_%d" % (i,j)] variant is faster for access, whereas the A[(i,j)] variant is faster for setup.

                                                  Setup    Execution
Dict, s.t. A["%d_%d" % (i,j)] contains M[i][j]    180s      30s
Dict, s.t. A[(i,j)] contains M[i][j]               66s     104s
Dict, s.t. Dict, s.t. A[i*N,j] contains M[i][j]    40s       8s

While the A["%d_%d" % (i,j)] is marginally slower in the current test, it would be preferable in the long run as the setup cost will only increase by a factor 10x, the execution cost by a factor 10,000x for my actual experiments.

Edit 2

I could speed up the dictionary version further by removing the string operations and instead using a large integer representation by concatenating the two indices using a suitably large multiplier on the first to avoid collision. The multiplication is typed using a cdef:

cdef unsigned int key(int a, int b): return a * 10000000 + b

It might still be possible to further improve speed by optimising the dict or moving the data structure to C, but this should be fast enough for my purposes. Any other suggestions would still be very welcome, though! Will report back if I find a more efficient solution using stl map or similar data structures.

Edit 3

Following suggestions of a colleague I've also implemented the system using <std::map> via it's Cython interface to store data in an <int,int> map. The implementation is actually mildly slower than the dict implementation once we have large amounts of data, with the key difference being in access speed:

                  Small data (25MM elements)         Large data (250MM elements)
          Time    total    setup    read/write       total    setup    read/write
Dict[int keys]      40s      25s       8s             369s     269s      72s
<std::map>          26s      11s      12s             376s     169s     177s
like image 997
kmh Avatar asked Nov 29 '11 21:11

kmh


1 Answers

If you aren't doing row- or column-based accesses, you can use a dict with a tuple key like A[(i,j)].

like image 171
retracile Avatar answered Sep 19 '22 08:09

retracile