when comparing this two ways of doing the same thing:
import numpy as np
import time
start_time = time.time()
for j in range(1000):
bv=np.loadtxt('file%d.dat' % (j+1))
if(j%100==0):
print bv[300,0]
T1=time.time() - start_time
print("--- %s seconds ---" % T1)
and
import numpy as np
import time
start_time = time.time()
for j in range(1000):
a=open('file%d.dat' % (j+1),'r')
b=a.readlines()
a.close()
for i in range(len(b)):
b[i]=b[i].strip("\n")
b[i]=b[i].split("\t")
b[i]=map(float,b[i])
bv=np.asarray(b)
if(j%100==0):
print bv[300,0]
T1=time.time() - start_time
print("--- %s seconds ---" % T1)
I have noticed that the second one is way faster. Is there any way to have something as concise as the first method and as fast as the second one? Why is loadtxt so slow with respect to performing the same task manually?
With a simple, not too large csv created with:
In [898]: arr = np.ones((1000,100))
In [899]: np.savetxt('float.csv',arr)
the loadtxt version:
In [900]: timeit data = np.loadtxt('float.csv')
112 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
fromfile can load text, though it doesn't preserve any shape info (no apparent speed advantage)
In [901]: timeit data = np.fromfile('float.csv', dtype=float, sep=' ').reshape(-1,100)
129 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
the most concise version of the 'manual' that I can come up with:
In [902]: %%timeit
...: with open('float.csv') as f:
...: data = np.array([line.strip().split() for line in f],float)
52.9 ms ± 589 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This 2x improvement over loadtxt seems typical of variations on this.
pd.read_csv is about the same time.
genfromtxt is a bit faster than loadtxt:
In [907]: timeit data = np.genfromtxt('float.csv')
98.2 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With