Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create iterate through a large list of list in python efficiently?

I have my data as such:

data = {'x':Counter({'a':1,'b':45}), 'y':Counter({'b':1, 'c':212})}

where my labels are the keys of the data and the key of the inner dictionary are features:

all_features = ['a','b','c']
all_labels = ['x','y']

I need to create list of list as such:

[[data[label][feat] for feat in all_features] for label in all_labels]

[out]:

[[1, 45, 0], [0, 1, 212]]

My len(all_features) is ~5,000,000 and len(all_labels) is ~100,000

The ultimate purpose is to create scipy sparse matrix, e.g.:

from collections import Counter
from scipy.sparse import csc_matrix
import numpy as np


all_features = ['a','b','c']
all_labels = ['x','y']

csc_matrix(np.array([[data[label][feat] for feat in all_features] for label in all_labels]))

but looping through a large list of list is rather inefficient.

So how can I look the large list of list efficiently?

Is there other way to create the scipy matrix from the data without looping through all features and labels?

like image 805
alvas Avatar asked May 19 '14 21:05

alvas


2 Answers

Converting a dictionary of dictionaries into a numpy or scipy array is, as you are experiencing, not too much fun. If you know all_features and all_labels before hand, you are probably better off using a scipy sparse COO matrix from the start to keep your counts.

Whether that is possible or not, you will want to keep your lists of features and labels in sorted order, to speed up look ups. So I am going to assume that the following doesn't change either array:

all_features = np.array(all_features)
all_labels = np.array(all_labels)
all_features.sort()
all_labels.sort()

Lets extract the labels in data in the order they are stored in the dictionary, and see where in all_labels does each item fall:

labels = np.fromiter(data.iterkeys(), all_labels.dtype, len(data))
label_idx = np.searchsorted(all_labels, labels)

Now lets count how many features does each label has, and compute from it the number of non-zero items there will be in your sparse array:

label_features = np.fromiter((len(c) for c in data.iteritems()), np.intp,
                             len(data))
indptr = np.concatenate(([0], np.cumsum(label_features)))
nnz = indptr[-1]

Now, we extract the features for each label, and their corresponding counts

import itertools
features_it = itertools.chain(*(c.iterkeys() for c in data.itervalues()))
features = np.fromiter(features_it, all_features.dtype, nnz)
feature_idx = np.searchsorted(all_features, features)
counts_it = itertools.chain(*(c.itervalues() for c in data.itervalues()))
counts = np.fromiter(counts_it, np.intp, nnz)

With what we have, we can create a CSR matrix directly, with labels as rows and features as columns:

sps_data = csr_matrix((counts, feature_idx, indptr),
                      shape=(len(all_labels), len(all_features)))

The only issue is that the rows of this sparse array are not in the order of all_labels, but in the order they came up when iterating over data. But we have feature_idx telling us where did each label end up, and we can rearrange the rows by doing:

sps_data = sps_data[np.argsort(label_idx)]

Yes, it is messy, confusing, and probably not very fast, but it works, and it will be much more memory efficient that what you proposed in your question:

>>> sps_data.A
array([[  1,  45,   0],
       [  0,   1, 212]], dtype=int64)
>>> all_labels
array(['x', 'y'], 
      dtype='<S1')
>>> all_features
array(['a', 'b', 'c'], 
      dtype='<S1')
like image 75
Jaime Avatar answered Oct 04 '22 16:10

Jaime


The dataset is quite large, so I don't think is practical to construct a temporary numpy array (if 32 bit integers are used a 1e5 x 5e6 matrix would require ~2 Terabytes of memory).

I assume you know the upper bound for number of labels.

The code could look like:

import scipy.sparse
n_rows = len(data.keys())
max_col = int(5e6)
temp_sparse = scipy.sparse.lil_matrix((n_rows, max_col), dtype='int')

for i, (features, counts) in enumerate(data.iteritems()):
    for label, n in counts.iteritem():
        j = label_pos[label]
        temp_sparse[i, j] = n
csc_matrix = temp_sparse.csc_matrix(temp_matrix)

Where label_pos returns the column-index of the label. If turns out is not practical to use a dictionary for storing the index of 5 millions of labels a hard drive database should do. The dictionary could be create online, so previous knowledge of all the labels is not necessary.

Iterating through 100,000 features would take a reasonable time, so I think this solution could work if the dataset is sparse enough. Good luck!

like image 21
ajeje Avatar answered Oct 04 '22 17:10

ajeje