Is it possible to use numpy.memmap
to map a large disk-based array of strings into memory?
I know it can be done for floats and suchlike, but this question is specifically about strings.
I am interested in solutions for both fixed-length and variable-length strings.
The solution is free to dictate any reasonable file format.
memmap. Create a memory-map to an array stored in a binary file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.
If all the strings have the same length, as suggested by the term "array", this is easily possible:
a = numpy.memmap("data", dtype="S10")
would be an example for strings of length 10.
Edit: Since apparently the strings don't have the same length, you need to index the file to allow for O(1) item access. This requires reading the whole file once and storing the start indices of all strings in memory. Unfortunately, I don't think there is a pure NumPy way of indexing without creating an array the same size as the file in memory first. This array can be dropped after extracting the indices, though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With