Is there a quick way to read the last N lines of a CSV file in Python, using numpy
or pandas
?
I cannot do skip_header
in numpy
or skiprow
in pandas
because the length of the file varies, and I would always need the last N rows.
I know I can use pure Python to read line by line from the last row of the file, but that would be very slow. I can do that if I have to, but a more efficient way with numpy
or pandas
(which is essentially using C) would be really appreciated.
With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:
In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 µs per loop
In [1026]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = f.readlines()
...: np.genfromtxt(lines[-5:],delimiter=',')
1000 loops, best of 3: 378 µs per loop
This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used
from collections import deque
and collected the last N lines in that structure. It also used StringIO
to feed the lines to the parser, which is an unnecessary complication. genfromtxt
takes input from anything that gives it lines, so a list of lines is just fine.
In [1031]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = deque(f,5)
...: np.genfromtxt(lines,delimiter=',')
1000 loops, best of 3: 382 µs per loop
Basically the same time as readlines
and slice.
deque
may have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.
timings for the row_count
followed by skip_header
approach are slower; it requires reading the file twice. skip_header
still has to read lines.
In [1046]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: ...: reader = csv.reader(f,delimiter = ",")
...: ...: data = list(reader)
...: ...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 µs per loop
For purposes of counting lines we don't need to use csv.reader
, though it doesn't appear to cost much extra time.
In [1048]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: lines=f.readlines()
...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
1000 loops, best of 3: 736 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With