Construct sparse matrix on disk on the fly in Python

Question

I'm currently doing some memory-intensive text processing, for which I have to construct a sparse matrix of float32s with dimensions of ~ (2M, 5M). I'm constructing this matrix column by column when reading a corpus of 5M documents. For this purpose I use a sparse dok_matrix data structure from SciPy. However, when arriving at the 500 000'th document, my memory is full (approx. 30GB is used) and the program crashes. What I eventually want to do, is perform a dimensionality reduction algorithm on the matrix using sklearn, but, as said, it is impossible to hold and construct the entire matrix in memory. I've looked into numpy.memmap, as sklearn supports this, and tried to memmap some of the underlying numpy data structures of the SciPy sparse matrix, but I could not succeed in doing this.

It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).

My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?

Thanks!

It is impossible for me to save the entire matrix in a dense format, since this would require 40TB of disk space. So I think that HDF5 and PyTables are no option for me (?).

My question is now: how can I construct a sparse matrix on the fly, but writing directly to disk instead of memory, and such that I can use it afterwards in sklearn?

Thanks!

ivirshup · Accepted Answer

We've come across similar problems in the field of single cell genomics data dealing with large sparse datasets on disk. I'll show you a small simple example of how I would deal with this. My assumptions are that you're very memory constrained, and probably can't fit multiple copies of the sparse matrix into memory at once. This will work even if you can't fit one entire copy.

I would construct an on disk sparse CSC matrix column by column. A sparse csc matrix uses 3 underlying arrays:

data: the values stored in the matrix
indices: the row index for each value in the matrix
indptr: an array of length n_cols + 1, which divides indices and data by which column they belong to.

As an explanatory example, the values for column i are stored in the range indptr[i]:indptr[i+1] of data. Similarly, the row indices for these values can be found by indices[indptr[i]:indptr[i+1]].

To simulate your data generating process (parsing a document, I assume) I'll define a function process_document which returns the values for indices and data for the relevant document.

import numpy as np
import h5py
from scipy import sparse

from tqdm import tqdm  # For monitoring the writing process
from typing import Tuple, Union  # Just for argument annotation

def process_document():
    """
    Simulate processing a document. Results in sparse vector represenation.
    """
    n_items = np.random.negative_binomial(2, .0001)
    indices = np.random.choice(2_000_000, n_items, replace=False)
    indices.sort()
    data = np.random.random(n_items).astype(np.float32)
    return indices, data

def data_generator(n):
    """Iterator which yields simulated data."""
    for i in range(n):
        yield process_document()

Now I'll create a group in and hdf5 file which will store the constituent arrays of a sparse matrix.

def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
    """
    Create a group in an hdf5 file that can store a CSC sparse matrix.
    """
    g = f.create_group(groupname)
    g.attrs["shape"] = shape
    g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
    g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
    g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
    return g

And finally a function for reading this group as a sparse matrix (this one is pretty simple).

def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
    return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])

Now we'll create the on-disk sparse matrix and write one column at a time to it (I'm using fewer columns since this can be kinda slow).

N_COLS = 10

def make_disk_matrix(f, groupname, data_iter, shape):
    group = make_sparse_csc_group(f, "mtx", shape)

    indptr = group["indptr"]
    data = group["data"]
    indices = group["indices"]
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices.resize((n_total,))
        data.resize((n_total,))
        indices[n_prev:] = cur_indices
        data[n_prev:] = cur_data
        indptr[doc_num+1] = n_total

# Writing
with h5py.File("data.h5", "w") as f:
    make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))

# Reading
with h5py.File("data.h5", "r") as f:
    mtx = read_sparse_csc_group(f["mtx"])

Again this is considering a very memory constrained situation, where you might not be able to fit the entire sparse matrix in memory when creating it. A much faster way to do this, if you can handle the entire sparse matrix plus at least one copy, would be to not bother with the on disk storage (similar to other suggestions). However, using a slight modification of this code should give you better performance:

def make_memory_mtx(data_iter, shape):
    indices_list = []
    data_list = []
    indptr = np.zeros(shape[1]+1, dtype=int)
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices_list.append(cur_indices)
        data_list.append(cur_data)
        indptr[doc_num+1] = n_total

    indices = np.concatenate(indices_list)
    data = np.concatenate(data_list)

    return sparse.csc_matrix((data, indices, indptr), shape=shape)

mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))

This should be fairly fast, since it only makes a copy of the data once you concatenate the arrays. Other current posted solutions reallocated the arrays as you processed, making many copies of large arrays.

Construct sparse matrix on disk on the fly in Python

Tags:

python

memory

matrix

numpy

scipy

CedricDeBoom

1 Answers

ivirshup

Recent Activity

Donate For Us

Construct sparse matrix on disk on the fly in Python

Tags:

python

memory

matrix

numpy

scipy

CedricDeBoom

1 Answers

ivirshup

Related questions

Recent Activity

Donate For Us