Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy.memmap for an array of strings?

Is it possible to use numpy.memmap to map a large disk-based array of strings into memory?

I know it can be done for floats and suchlike, but this question is specifically about strings.

I am interested in solutions for both fixed-length and variable-length strings.

The solution is free to dictate any reasonable file format.

like image 353
NPE Avatar asked May 05 '11 11:05

NPE


People also ask

What does NumPy Memmap do?

memmap. Create a memory-map to an array stored in a binary file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.

Can NumPy work with arrays of strings?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.


1 Answers

If all the strings have the same length, as suggested by the term "array", this is easily possible:

a = numpy.memmap("data", dtype="S10")

would be an example for strings of length 10.

Edit: Since apparently the strings don't have the same length, you need to index the file to allow for O(1) item access. This requires reading the whole file once and storing the start indices of all strings in memory. Unfortunately, I don't think there is a pure NumPy way of indexing without creating an array the same size as the file in memory first. This array can be dropped after extracting the indices, though.

like image 124
Sven Marnach Avatar answered Sep 19 '22 00:09

Sven Marnach