Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas DataFrame from first and last row of csv

All -

I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:

pd.read_csv(filename, nrows=1)

I also have no problem grabbing the last row of a text file in various ways, such as:

with open(filename) as f:
    last_line = f.readlines()[-1]

However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?

EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.

Thanks!

like image 639
wrcobb Avatar asked Nov 07 '14 17:11

wrcobb


People also ask

How do I read the first N line of a CSV file in pandas?

To read the first n lines of a file, you can use the pandas call pd. read_csv(filename, nrows=n) .

How do you find the last and first row of a data frame?

You can get the first row with iloc[0] and the last row with iloc[-1] . If you want to get the value of the element, you can do with iloc[0]['column_name'] , iloc[-1]['column_name'] .


Video Answer


4 Answers

Just use head and tail and concat. You can even adjust the number of rows.

import pandas as pd

df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])

print concatenated

Result:

           Date  Cases
0      9/1/2014     45
121  12/31/2014     97

Adjusting head and tail to take in 5 rows from top and 10 from bottom...

           Date  Cases
0      9/1/2014     45
1      9/2/2014    104
2      9/3/2014     47
3      9/4/2014    108
4      9/5/2014     49
112  12/22/2014     30
113  12/23/2014     81
114  12/24/2014     99
115  12/25/2014     85
116  12/26/2014     55
117  12/27/2014     91
118  12/28/2014     68
119  12/29/2014    109
120  12/30/2014     55
121  12/31/2014     97

One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.

import pandas as pd
import csv

top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values

with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
    last_line = f.readlines()[-1].strip().split(",")
    c = csv.writer(g)
    c.writerow(headers)
    c.writerow(last_line)

bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)

print concatenated

Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 0.9s]

How it scales versus 15 million rows, maybe that's your ballgame now. So I decided to test it against exactly 15,728,626 rows and the results seem good enough.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 3.3s]
like image 171
NullDev Avatar answered Oct 07 '22 14:10

NullDev


So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:

import pandas as pd
import StringIO

with open('tst.csv') as f:
    first_line = f.readline()
    for line in f:
        pass #iterate to the end
    last_line = line

mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))
like image 37
JD Long Avatar answered Oct 07 '22 14:10

JD Long


This is the best solution I found

import pandas as pd

count=len(open(filename).readlines()) 

df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)
like image 4
Stefan Manole Avatar answered Oct 07 '22 14:10

Stefan Manole


You want this answer https://stackoverflow.com/a/18603065/4226476 - not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.

Then wrap the two lines in a StringIO:

from cStringIO import StringIO
import pandas as pd

# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)
like image 2
allen-smithee Avatar answered Oct 07 '22 14:10

allen-smithee