I have a huge file (around 30GB), each line includes coordination of a point on a 2D surface. I need to load the file into Numpy array: points = np.empty((0, 2))
, and apply scipy.spatial.ConvexHull
over it. Since the size of the file is very large I couldn't load it at once into the memory, I want to load it as batch of N lines and apply scipy.spatial.ConvexHull
on the small part and then load the next N rows! What's an efficient to do it?
I found out that in python you can use islice
to read N lines of a file but the problem is lines_gen
is a generator object, which gives you each line of the file and should be used in a loop, so I am not sure how can I convert the lines_gen
into Numpy array in an efficient way?
from itertools import islice
with open(input, 'r') as infile:
lines_gen = islice(infile, N)
My input file:
0.989703 1
0 0
0.0102975 0
0.0102975 0
1 1
0.989703 1
1 1
0 0
0.0102975 0
0.989703 1
0.979405 1
0 0
0.020595 0
0.020595 0
1 1
0.979405 1
1 1
0 0
0.020595 0
0.979405 1
0.969108 1
...
...
...
0 0
0.0308924 0
0.0308924 0
1 1
0.969108 1
1 1
0 0
0.0308924 0
0.969108 1
0.95881 1
0 0
With your data, I can read it in 5 line chunks like this:
In [182]: from itertools import islice
with open(input,'r') as infile:
while True:
gen = islice(infile,N)
arr = np.genfromtxt(gen, dtype=None)
print arr
if arr.shape[0]<N:
break
.....:
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]
The same thing read as one chunk is:
In [183]: with open(input,'r') as infile:
arr = np.genfromtxt(infile, dtype=None)
.....:
In [184]: arr
Out[184]:
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
(0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
(0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
(0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
(0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
(0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
(0.95881, 1), (0.0, 0)],
dtype=[('f0', '<f8'), ('f1', '<i4')])
(This is in Python 2.7; in 3 there's a byte/string issue I need to work around).
You could try the second method from this post and read the file in chunks by referring to a given line using a pre-computed lines offset array if it fits into memory. Here is an example of what I typically use to avoid loading whole files to memory::
data_file = open("data_file.txt", "rb")
line_offset = []
offset = 0
while 1:
lines = data_file.readlines(100000)
if not lines:
break
for line in lines:
line_offset.append(offset)
offset += len(line)
# reading a line
line_to_read = 1
line = ''
data_file.seek(line_offset[line_to_read])
line = data_file.readline()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With