Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a specific line number in a csv with pandas

I have a huge dataset and I am trying to read it line by line. For now, I am reading the dataset using pandas:

df = pd.read_csv("mydata.csv", sep =',', nrows = 1)

This function allows me to read only the first line, but how can I read the second, the third one and so on? (I would like to use pandas.)

EDIT: To make it more clear, I need to read one line at a time as the dataset is 20 GB and I cannot keep all the stuff in memory.

like image 717
Guido Muscioni Avatar asked Dec 01 '17 04:12

Guido Muscioni


2 Answers

One way could be to read part by part of your file and store each part, for example:

df1 = pd.read_csv("mydata.csv", nrows=10000)

Here you will skip the first 10000 rows that you already read and stored in df1, and store the next 10000 rows in df2.

df2 = pd.read_csv("mydata.csv", skiprows=10000 nrows=10000)
dfn = pd.read_csv("mydata.csv", skiprows=(n-1)*10000, nrows=10000)

Maybe there is a way to introduce this idea into a for or while loop.

like image 83
Davidvs Avatar answered Sep 21 '22 22:09

Davidvs


Looking in the pandas documentation, there is a parameter for read_csv function:

skiprows

If a list is assigned to this parameter it will skip the line indexed by the list:

skiprows = [0,1]

This will skip the first one and the second line. Thus a combination of nrow and skiprows allow to read each line in the dataset separately.

like image 39
Guido Muscioni Avatar answered Sep 17 '22 22:09

Guido Muscioni