Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently Read last 'n' rows of CSV into DataFrame

Tags:

A few methods to do this:

  1. Read the entire CSV and then use df.tail
  2. Somehow reverse the file (whats the best way to do this for large files?) and then use nrows argument to read
  3. Somehow find the number of rows in the CSV, then use skiprows and read required number of rows.
  4. Maybe do chunk read discarding initial chunks (though not sure how this would work)

Can it be done in some easier way? If not, which amongst these three should be prefered and why?

Possibly related:

  1. Efficiently finding the last line in a text file
  2. Reading parts of ~13000 row CSV file with pandas read_csv and nrows

Not directly related:

  1. How to get the last n row of pandas dataframe?
like image 700
Nipun Batra Avatar asked Jun 14 '13 12:06

Nipun Batra


People also ask

How do I get the last 5 rows of a data frame?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

Which method is DataFrame reads last n rows from DataFrame?

pandas.DataFrame.tail() In Python's Pandas module, the Dataframe class provides a tail() function to fetch bottom rows from a Dataframe i.e. It returns the last n rows from a dataframe.

How do I drop the last n rows of a data frame?

We can remove the last n rows using the drop() method. drop() method gets an inplace argument which takes a boolean value. If inplace attribute is set to True then the dataframe gets updated with the new value of dataframe (dataframe with last n rows removed).


2 Answers

I don't think pandas offers a way to do this in read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

from collections import deque from StringIO import StringIO  with open(fname, 'r') as f:     q = deque(f, 2)  # replace 2 with n (lines read at the end)  In [12]: q Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)          # these are the last two lines of my csv  In [13]: pd.read_csv(StringIO(''.join(q)), header=None) 

Another option worth trying is to get the number of lines in a first pass and then read the file again, skip that number of rows (minus n) using read_csv...

like image 123
Andy Hayden Avatar answered Sep 29 '22 13:09

Andy Hayden


Here's a handy way to do. Works well for what I like to do -

import tailer import pandas as pd import io  with open(filename) as file:     last_lines = tailer.tail(file, 15)  df = pd.read_csv(io.StringIO('\n'.join(last_lines)), header=None) 

You need to install tailer, to have this working:

pip install --user tailer 
like image 27
Parikshit Bhinde Avatar answered Sep 29 '22 13:09

Parikshit Bhinde