Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I partially read a huge CSV file?

Tags:

python

pandas

I have a very big csv file so that I can not read them all into the memory. I only want to read and process a few lines in it. So I am seeking a function in Pandas which could handle this task, which the basic python can handle this well:

with open('abc.csv') as f:     line = f.readline()     # pass until it reaches a particular line number.... 

However, if I do this in pandas, I always read the first line:

datainput1 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 ) datainput2 = pd.read_csv('matrix.txt',sep=',', header = None, nrows = 1 ) 

I am looking for some easier way to handle this task in pandas. For example, if I want to read rows from 1000 to 2000. How can I do this quickly?

I want to use pandas because I want to read data into the dataframe.

like image 633
lserlohn Avatar asked Mar 29 '15 20:03

lserlohn


2 Answers

Use chunksize:

for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):     #do something 

To answer your second part do this:

df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000) 

This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.

like image 197
EdChum Avatar answered Sep 23 '22 09:09

EdChum


In addition to EdChums answer I find the nrows argument useful which simply defines the number of rows you want to import. Thereby you don't get an iterator but rather can just import a part of the whole file of size nrows. It works with skiprows too.

df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows= 1000, nrows=1000) 
like image 22
petezurich Avatar answered Sep 25 '22 09:09

petezurich