Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to mmap a 2d array from a text file

Tags:

python

numpy

mmap

I have very large files containing 2d arrays of positive integers

  • Each file contains a matrix

I would like to process them without reading the files into memory. Luckily I only need to look at the values from left to right in the input file. I was hoping to be able to mmap each file so I can process them as if they were in memory but without actually reading in the files into memory.

Example of smaller version:

[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17], 
 [3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0], 
 [4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13], 
 [5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]

Is it possible to mmap such a file so I can then process the np.int64 values with

for i in range(rownumber):
    for j in range(rowlength):
        process(M[i, j])

To be clear, I don't want ever to have all my input file in memory as it won't fit.

like image 582
graffe Avatar asked Dec 17 '22 13:12

graffe


1 Answers

Updated Answer

On the basis of your comments and clarifications, it appears you actually have a text file with a bunch of square brackets in it that is around 4 lines long with 1,000,000,000 ASCII integers per line separated by commas. Not a very efficient format! I would suggest you simply pre-process the file to remove all square brackets, linefeeds, and spaces and convert the commas to newlines so that you get one value per line which you can easily deal with.

Using the tr command to transliterate, that would be this:

# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt

Your file then looks like this and you can readily process one value at a time in Python.

2
2
6
10
2
6
...
...

In case you are on Windows, the tr tool is available for Windows in GNUWin32 and in the Windows Subsystem for Linux thing (git bash?).

You can go still further and make a file that you can memmap() like in the second part of my answer, you could then randomly find any byte in the file. So, taking the preprocessed.txt created above, you can make a binary version like this:

import struct

# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
    for line in ifile:
        ofile.write(struct.pack('q',int(line)))

Original Answer

You can do that like this. The first part is just setup:

#!/usr/bin/env python3

import numpy as np

# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)

# Write to file as binary
a.tofile('a.dat')

Now check the file by hex-dumping it in the shell:

xxd a.dat

00000000: 0000 0000 0000 0000 0100 0000 0000 0000  ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000  ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000  ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000  ................

Now that we are all set up, let's memmap() the file:

# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))

print(mm[1,2])      # prints 6
like image 63
Mark Setchell Avatar answered Jan 03 '23 15:01

Mark Setchell