Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing a list of lists in Cython, with nogil

In Python I have a list of lists as input:

input = [[0,1,2],[0,3,4,5],[0,6]]

In reality, the number of sub-lists is tens of thousands. The length of each sub-list could vary greatly, from zero or one value to many hundreds.

I want to pass the input data as some 2D structure to a Cython module that will process it. I wish to process the data on multiple cores, thus I use prange with nogil=True:

from cython.parallel import prange

cpdef int my_func(long[:,:] arr):
    cdef int i,j
    for i in prange(arr.shape[0], nogil=True):
        for j in range(arr.shape[1]):
                    # Do something
            pass
    return 42

I see the following solutions:

  1. Put the list of lists into a 2D ndarray. But as the length of each sub-list vary greatly, a ndarray is not an ideal data structure
  2. Modify my_func to accept a list of lists. The problem is that part of the code is executed without the GIL and therefore not able to access python objects.

Does anyone have suggestions, preferably with code, on how to solve this problem?

like image 582
matthiash Avatar asked Aug 21 '18 18:08

matthiash


1 Answers

I would probably go for a flattened array, where the starts of the single lists are stored in an auxiliary array, not unsimilar to csr-matrices.

Here is an example how this data structure could be constructed from list of lists (using numpy, but you could also use array.array; it is also not really optimized for speed), just to give you the idea:

import numpy as np
def flatten_list_of_lists(lst_of_lsts):
    N = sum(map(len, lst_of_lsts))  # number of elements in the flattened array   
    starts = np.empty(len(lst_of_lsts)+1, dtype=np.uint64)  # needs place for one sentinel
    values = np.empty(N, dtype=np.int64)

    starts[0], cnt = 0, 0
    for i,lst in enumerate(lst_of_lsts):
        for el in lst:
            values[cnt] = el
            cnt += 1       # update index in the flattened array for the next element
        starts[i+1] = cnt  # remember the start of the next list

    return starts, values

So for your example produce the following result:

#         starts                                 values
(array([0, 3, 7, 9], dtype=uint64), array([0, 1, 2, 0, 3, 4, 5, 0, 6]))

You can see: there are 3 sublists starting at 0, 3 and 7 and being 3 (difference starts[1]-starts[0]), 4 and 2 elements long.

And this is how this data could be consumed:

%%cython
from cython.parallel import prange

cpdef int my_func(unsigned long long[::1] starts, long long[::1] values):
    cdef int i,j
    for i in prange(len(starts)-1, nogil=True):
        for j in range(starts[i], starts[i+1]):
                    # Do something with values[j]
            pass
    return 42
like image 113
ead Avatar answered Nov 18 '22 13:11

ead