All -
I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The purpose of this exercise is to be able to easily grab some attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:
pd.read_csv(filename, nrows=1)
I also have no problem grabbing the last row of a text file in various ways, such as:
with open(filename) as f:
last_line = f.readlines()[-1]
However, getting these two things into a single DataFrame has thrown me for a loop. Any insight into how best to achieve this goal?
EDIT NOTE: I am trying to achieve this task without loading all of the data into a single DataFrame first as I am dealing with pretty large (>15MM rows) csv files.
Thanks!
To read the first n lines of a file, you can use the pandas call pd. read_csv(filename, nrows=n) .
You can get the first row with iloc[0] and the last row with iloc[-1] . If you want to get the value of the element, you can do with iloc[0]['column_name'] , iloc[-1]['column_name'] .
Just use head
and tail
and concat
. You can even adjust the number of rows.
import pandas as pd
df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])
print concatenated
Result:
Date Cases
0 9/1/2014 45
121 12/31/2014 97
Adjusting head
and tail
to take in 5 rows from top and 10 from bottom...
Date Cases
0 9/1/2014 45
1 9/2/2014 104
2 9/3/2014 47
3 9/4/2014 108
4 9/5/2014 49
112 12/22/2014 30
113 12/23/2014 81
114 12/24/2014 99
115 12/25/2014 85
116 12/26/2014 55
117 12/27/2014 91
118 12/28/2014 68
119 12/29/2014 109
120 12/30/2014 55
121 12/31/2014 97
One possible approach that can be used if you don't want to load the whole CSV file as a dataframe is to process them as CSVs alone. The following code is similar to your approach.
import pandas as pd
import csv
top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values
with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
last_line = f.readlines()[-1].strip().split(",")
c = csv.writer(g)
c.writerow(headers)
c.writerow(last_line)
bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)
print concatenated
Result is the same, except for the index. Tested against a million rows and it was processed in a about a second.
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 0.9s]
How it scales versus 15 million rows, maybe that's your ballgame now.
So I decided to test it against exactly 15,728,626 rows and the results seem good enough.
Date Cases
0 9/1/2014 45
1 7/25/4885 99
[Finished in 3.3s]
So the way to do this without reading in the whole file into Python first is to grab the first line then iterate through the file to the last line. Then use StringIO to suck them into Pandas. Maybe something like this:
import pandas as pd
import StringIO
with open('tst.csv') as f:
first_line = f.readline()
for line in f:
pass #iterate to the end
last_line = line
mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))
This is the best solution I found
import pandas as pd
count=len(open(filename).readlines())
df=pd.read_csv(filename, skiprows=range(2,count-1), header=0)
You want this answer https://stackoverflow.com/a/18603065/4226476 - not the accepted answer but the best because it seeks backwards for the first newline instead of guessing.
Then wrap the two lines in a StringIO:
from cStringIO import StringIO
import pandas as pd
# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With