Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed of writing a numpy array to a text file

I need to write a very "high" two-column array to a text file and it is very slow. I find that if I reshape the array to a wider one, the writing speed is much quicker. For example

import time
import numpy as np
dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

With the same number of elements in the three data matrixes, why is the last one much more time-consuming than the other two? Is there any way to speed up the writing of a "high" data array?

like image 862
Sheldon Avatar asked Dec 17 '18 18:12

Sheldon


People also ask

How do I convert a NumPy array to a text file?

Let us see how to save a numpy array to a text file. Creating a text file using the in-built open() function and then converting the array into string and writing it into the text file using the write() function. Finally closing the file using close() function.

Are NumPy arrays fast?

Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.

Is NumPy array slow?

NumPy random for generating an array of random numbers ndarray of 1000 random numbers. The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.

Is NumPy faster than Python?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.


2 Answers

As hpaulj pointed out, savetxt is looping through the rows of X and formatting each row individually:

for row in X:
    try:
        v = format % tuple(row) + newline
    except TypeError:
        raise TypeError("Mismatch between array dtype ('%s') and "
                        "format specifier ('%s')"
                        % (str(X.dtype), format))
    fh.write(v)

I think the main time-killer here is all the string interpolation calls. If we pack all the string interpolation into one call, things go much faster:

with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())
    f.write(data)

import io
import time
import numpy as np

dataMat1 = np.random.rand(1000,1000)
dataMat2 = np.random.rand(2,500000)
dataMat3 = np.random.rand(500000,2)
start = time.perf_counter()
with open('/tmp/test1.txt','w') as f:
    np.savetxt(f,dataMat1,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test2.txt','w') as f:
    np.savetxt(f,dataMat2,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test3.txt','w') as f:
    np.savetxt(f,dataMat3,fmt='%g',delimiter=' ')
end = time.perf_counter()
print(end-start)

start = time.perf_counter()
with open('/tmp/test4.txt','w') as f:
    fmt = ' '.join(['%g']*dataMat3.shape[1])
    fmt = '\n'.join([fmt]*dataMat3.shape[0])
    data = fmt % tuple(dataMat3.ravel())        
    f.write(data)
end = time.perf_counter()
print(end-start)

reports

0.1604848340011813
0.17416274400056864
0.6634929459996783
0.16207673999997496
like image 118
unutbu Avatar answered Oct 31 '22 10:10

unutbu


The code for savetxt is Python and accessible. Basically it does a formatted write for each row/line. In effect it does

for row in arr:
   f.write(fmt%tuple(row))

where fmt is derived from your fmt and shape of the array, e.g.

'%g %g %g ...'

So it's doing a file write for each row of the array. The line format takes some time as well, but it's done in memory with Python code.

I expect loadtxt/genfromtxt will show the same time pattern - it takes longer to read many rows.

pandas has a faster csv load. I haven't seen any discussion of its write speed.

like image 39
hpaulj Avatar answered Oct 31 '22 09:10

hpaulj