Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy.loadtxt is way slower than open.....readlines()

Tags:

python

numpy

when comparing this two ways of doing the same thing:

import numpy as np
import time
start_time = time.time()
for j in range(1000):
    bv=np.loadtxt('file%d.dat' % (j+1))
    if(j%100==0):   
        print bv[300,0] 
T1=time.time() - start_time
print("--- %s seconds ---" % T1)

and

import numpy as np
import time
start_time = time.time()
for j in range(1000):
    a=open('file%d.dat' % (j+1),'r')
    b=a.readlines()
    a.close()
    for i in range(len(b)):
        b[i]=b[i].strip("\n")
        b[i]=b[i].split("\t")
        b[i]=map(float,b[i])
    bv=np.asarray(b)
    if(j%100==0):   
        print bv[300,0]  
T1=time.time() - start_time
print("--- %s seconds ---" % T1)

I have noticed that the second one is way faster. Is there any way to have something as concise as the first method and as fast as the second one? Why is loadtxt so slow with respect to performing the same task manually?

like image 741
3sm1r Avatar asked Dec 17 '25 15:12

3sm1r


1 Answers

With a simple, not too large csv created with:

In [898]: arr = np.ones((1000,100))
In [899]: np.savetxt('float.csv',arr)

the loadtxt version:

In [900]: timeit data = np.loadtxt('float.csv')
112 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

fromfile can load text, though it doesn't preserve any shape info (no apparent speed advantage)

In [901]: timeit data = np.fromfile('float.csv', dtype=float, sep=' ').reshape(-1,100)
129 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

the most concise version of the 'manual' that I can come up with:

In [902]: %%timeit
     ...: with open('float.csv') as f:
     ...:     data = np.array([line.strip().split() for line in f],float)
52.9 ms ± 589 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This 2x improvement over loadtxt seems typical of variations on this.

pd.read_csv is about the same time.

genfromtxt is a bit faster than loadtxt:

In [907]: timeit data = np.genfromtxt('float.csv')
98.2 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 73
hpaulj Avatar answered Dec 20 '25 14:12

hpaulj