I have very large files containing 2d arrays of positive integers
I would like to process them without reading the files into memory. Luckily I only need to look at the values from left to right in the input file. I was hoping to be able to mmap
each file so I can process them as if they were in memory but without actually reading in the files into memory.
Example of smaller version:
[[2, 2, 6, 10, 2, 6, 7, 15, 14, 10, 17, 14, 7, 14, 15, 7, 17],
[3, 3, 7, 11, 3, 7, 0, 11, 7, 16, 0, 17, 17, 7, 16, 0, 0],
[4, 4, 8, 7, 4, 13, 0, 0, 15, 7, 8, 7, 0, 7, 0, 15, 13],
[5, 5, 9, 12, 5, 14, 7, 13, 9, 14, 16, 12, 13, 14, 7, 16, 7]]
Is it possible to mmap
such a file so I can then process the np.int64
values with
for i in range(rownumber):
for j in range(rowlength):
process(M[i, j])
To be clear, I don't want ever to have all my input file in memory as it won't fit.
Updated Answer
On the basis of your comments and clarifications, it appears you actually have a text file with a bunch of square brackets in it that is around 4 lines long with 1,000,000,000 ASCII integers per line separated by commas. Not a very efficient format! I would suggest you simply pre-process the file to remove all square brackets, linefeeds, and spaces and convert the commas to newlines so that you get one value per line which you can easily deal with.
Using the tr
command to transliterate, that would be this:
# Delete all square brackets, newlines and spaces, change commas into newlines
tr -d '[] \n' < YourFile.txt | tr , '\n' > preprocessed.txt
Your file then looks like this and you can readily process one value at a time in Python.
2
2
6
10
2
6
...
...
In case you are on Windows, the tr
tool is available for Windows in GNUWin32
and in the Windows Subsystem for Linux thing (git bash?).
You can go still further and make a file that you can memmap()
like in the second part of my answer, you could then randomly find any byte in the file. So, taking the preprocessed.txt
created above, you can make a binary version like this:
import struct
# Make binary memmapable version
with open('preprocessed.txt', 'r') as ifile, open('preprocessed.bin', 'wb') as ofile:
for line in ifile:
ofile.write(struct.pack('q',int(line)))
Original Answer
You can do that like this. The first part is just setup:
#!/usr/bin/env python3
import numpy as np
# Create 2,4 Numpy array of int64
a = np.arange(8, dtype=np.int64).reshape(2,4)
# Write to file as binary
a.tofile('a.dat')
Now check the file by hex-dumping it in the shell:
xxd a.dat
00000000: 0000 0000 0000 0000 0100 0000 0000 0000 ................
00000010: 0200 0000 0000 0000 0300 0000 0000 0000 ................
00000020: 0400 0000 0000 0000 0500 0000 0000 0000 ................
00000030: 0600 0000 0000 0000 0700 0000 0000 0000 ................
Now that we are all set up, let's memmap()
the file:
# Memmap file and access values via 'mm'
mm = np.memmap('a.dat', dtype=np.int64, mode='r', shape=(2,4))
print(mm[1,2]) # prints 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With